I am working on a complex database application written with PHP and Mysqli. For large database operations I´m using daemons (also PHP) working in the background. During these operations which may take several minutes I want to prevent the users from accesing the data being affected and show them a message instead.
I thought about creating a Mysql table and insert a row each time a specific daemon operation takes place. Then I would be always able to check if a certain operation takes place while trying to access the data.
However, of course it is important that the records do not stay in the database, if the daemon process gets terminated by any reason (kill from console, losing database connection, pulling the plug, etc.) I do not think that Mysql transactions / rollbacks are able do this, because a commit is necessary in order to make the changes public and the records will remain in the database if terminated.
Is there a way to ensure that the records get deleted if the process gets terminated?
This is an interesting problem, I actually implemented it for a University course a few years ago.
The trick I used is to play with the transaction isolation. If your daemons create the record indicating they are in progress, but do not commit it, then you are correct in saying that the other clients will not see that record. But, if you set the isolation level for that client to READ UNCOMMITTED, you will see the record saying it's in progress - since READ UNCOMMITTED will show you the changes from other clients which are not yet committed. (The daemon would be left with the default isolation level).
You should set the client's isolation level to read uncommitted for that daemon check only, not for it's other work as it could be very dangerous.
If the daemon crashes, the transaction gets aborted and the record goes. If it's successful, it can either mark it done in the db or delete the record etc, and then it commits.
This is really nice, since if the daemon fails for any reason all the changes it made are reversed and it can be retried.
If you need more explanation or code I may be able to find my assignment somewhere :)
Transaction isolation levels reference
Note that this all requires InnoDB or any good transactional DB.
Related
SQL Server has many ways of locking resource. I am trying to understand what make SQL Server pick what level of locks it will choose. I want to know when will it use Page or table lock over row lock?
Problem
I have a PHP application that uses transaction with every http request to ensure all queries are executed before a commit. One issue that is puzzling me is when many (5+) people use the application the app seems to be hanging (spinning for a long periods of time)! Nothing I can think of will cause such a behaviors except for database locks! The scenario that I am thinking it happening is that SQL Server is choosing to pick Page or Table lock over rowlock for some reason. I am trying to ensure that SQL Server is doing a row lock not Page or table lock. I am using an ORM so I can't use ROWLOCK hint in my queries.
Is there a way for me to run queries explain plan to see what lock level will be used?
As you can see here there is no default granularity in lock modes.
In general the optimizer will choose the best course of action to handle this.
Could it be a case of livelock due to a long running transaction that leads to resource starvation?
You can also check here and here for information on lock escalation, but I'd suggest to not disable it for any table.
When the web server receives a request for my PHP script, I presume the server creates a dedicated process to run the script. If, before the script exits, another request to the same script comes, another process gets started -- am I correct, or the second request will be queued in the server, waiting for the first request to exit? (Question 1)
If the former is correct, i.e. the same script can run simultaneously in a different process, then they will try to access my database.
When I connect to the database in the script:
$DB = mysqli_connect("localhost", ...);
query it, conduct more or less lengthy calculations and update it, I don't want the contents of the database to be modified by another instance of a running script.
Question 2: Does it mean that since connecting to the database until closing it:
mysqli_close($DB);
the database is blocked for any access from other software components? If so, it effectively prevents the script instances from running concurrently.
UPDATE: #OllieJones kindly explained that the database was not blocked.
Let's consider the following scenario. The script in the first process discovers an eligible user in the Users table and starts preparing data to append for that user in the Counter table. At this moment the script in the other process preempts and deletes the user from the Users table and the associate data in the Counter table; it then gets preempted by the first script which writes the data for the user no more existing. These data become in the head-detached state, i.e. unaccessible.
How to prevent such a contention?
In modern web servers, there's a pool of processes (or possibly threads) handling requests from users. Concurrent requests to the same script can run concurrently. Each request-handler has its own connection to the DBMS (they're actually maintained in a pool, but that's a story for another day).
The database is not blocked while individual request-handlers are using it, unless you block it explicitly by locking a table or doing a request like SELECT ... FOR UPDATE. For more information on this deep topic, read about transactions.
Therefore, it's important to write your database queries in such a way that they won't interfere with each other. For example, if you need to learn the value of an auto-incremented column right after you insert a row, you should use LAST_INSERT_ID() or mysqli_insert_id() instead of trying to query the data base: another user may have inserted another row in the meantime.
The system test discipline for scaled-up web sites usually involves a rigorous load test in order to shake out all this concurrency.
If you're doing a bunch of work on a particular entity, in your case a User, you use a transaction.
First you do
BEGIN
to start the transaction. Then you do
SELECT whatever FROM User WHERE user_id = <<whatever>> FOR UPDATE
to choose the user and mark that user's row as busy-being-updated. Then you do all the work you need to do to fill out various rows in various tables relating to that user.
Finally you do
COMMIT
If you messed things up, or don't want to go through with the change, you do
ROLLBACK
and all your changes will be restored to their state right before the SELECT ... FOR UPDATE.
Why does this work? Because if another client does the same SELECT .... FOR UPDATE, MySQL will delay that request until the first one either gets COMMIT or ROLLBACK.
If another client works with a different userid, the operations may proceed concurrently.
You need the InnoDB access method to use transactions: MyISAM doesn't support them.
Multiple reads can be done concurrently, if there is a write operation then it will block all other operations. A read will block all writes.
How does PHP handle multiple requests from users? Does it process them all at once or one at a time waiting for the first request to complete and then moving to the next.
Actually, I'm adding a bit of wiki to a static site where users will be able to edit addresses of businesses if they find them inaccurate or if they can be improved. Only registered users may do so. When a user edits a business name, that name along with it's other occurrences is changed in different rows in the table. I'm a little worried about what would happend if 10 users were doing this simultaneously. It'd be a real mishmash of things. So does PHP do things one at time in order received per script (update.php) or all at once.
Requests are handled in parallel by the web server (which runs the PHP script).
Updating data in the database is pretty fast, so any update will appear instantaneous, even if you need to update multiple tables.
Regarding the mish mash, for the DB, handling 10 requests within 1 second is the same as 10 requests within 10 seconds, it won't confuse them and just execute them one after the other.
If you need to update 2 tables and absolutely need these 2 updates to run subsequently without being interrupted by another update query, then you can use transactions.
EDIT:
If you don't want 2 users editing the same form at the same time, you have several options to prevent them. Here are a few ideas:
You can "lock" that record for edition whenever a user opens the page to edit it, and not let other users open it for edition. You might run into a few problems if a user doesn't "unlock" the record after they are done.
You can notify in real time (with AJAX) a user that the entry they are editing was modified, just like on stack overflow when a new answer or comment was posted as you are typing.
When a user submits an edit, you can check if the record was edited between when they started editing and when they tried to submit it, and show them the new version beside their version, so that they manually "merge" the 2 updates.
There probably are more solutions but these should get you started.
It depends on which version of Apache you are using and how it is configured, but a common default configuration uses multiple workers with multiple threads to handle simultaneous requests. See http://httpd.apache.org/docs/2.2/mod/worker.html for a rundown of how this works. The end result is that your PHP scripts may together have dozens of open database connections, possibly sending several queries at the exact same time.
However, your DBMS is designed to handle this. If you are only doing simple INSERT queries, then your code doesn't need to do anything special. Your DBMS will take care of the necessary locks on its own. Row-level locking will be fastest for multiple INSERTs, so if you use MySQL, you should consider the InnoDB storage engine.
Of course, your query can always fail whether it's due to too many database connections, a conflict on a unique index, etc. Wrap your queries in try catch blocks to handle this case.
If you have other application-layer concerns about concurrency, such as one user overwriting another user's changes, then you will need to handle these in the PHP script. One way to handle this is to use revision numbers stored along with your data, and refusing to execute the query if the revision number has changed, but how you handle it all depends on your application.
I have done extensive research about MySQL transactions via the PHP PDO interface. I am still a little fuzzy about the actual background workings of the transaction methods. Specifically, I need to know if there is any reason that I should want to prevent all my queries (SELECTs included) inside a transaction spanning from the beginning of the script to the end? Of course, handling any error in the transaction and rolling them back if need be.
I want to know if there is any locking going on during a transaction and if so, is it row level locking because it is InnoDB?
Don't do that.
The reason for this is that transactions take advantage of MVCC a mechanism by which every piece of data updated is in fact not update-in-place but merely inserted in somewhere else.
MVCC implies allocating memory and or storage space to accumulate and operate all of the changes you send it without committing them to disk until you issue a COMMIT.
That means that while your entire script runs all changes are stored until the script ends. And all of the records that you try and change during the transaction are marked as "work in progress" so that other processes/threads can know that this data will soon be invalidated.
Having certain pieces of data marked as "work in progress" for the entire length of the script means that any other concurrent update will see the flag and say "i have to wait until this finishes so I'll get the most recent data".
This includes SELECTS depending on isolation levels. Selecting stuff that is marked as "work in progress" may not be what you want because some tables that you may want to join may contain already updated data while other tables are not updated yet resulting in a dirty read.
Transactionality and atomicity of operations is desirable but costly. Use it where it's needed. Yes that means more work for you to figure out where race conditions can happen and even if race conditions occur you have to make the decision if they are really critical or is "some" data loss/mix acceptable.
Would you like your logs and visit counters and other statistics to drag down the speed of your entire site? Or is the quality of that information sacrificable for speed (as long as it's not an analytics suit you can afford the occasional collision).
Would you like a seat reservation application to miss-fire and allow more users to grab a seat even after the seat cout was 0? of course not - here you want to leverage transactions and isolation levels to ensure that never happens.
We have this PHP application which selects a row from the database, works on it (calls an external API which uses a webservice), and then inserts a new register based on the work done. There's an AJAX display which informs the user of how many registers have been processed.
The data is mostly text, so it's rather heavy data.
The process is made by thousands of registers a time. The user can choose how many registers to start working on. The data is obtained from one table, where they are marked as "done". No "WHERE" condition, except the optional "WHERE date BETWEEN date1 AND date2".
We had an argument over which approach is better:
Select one register, work on it, and insert the new data
Select all of the registers, work with them in memory and insert them in the database after all the work was done.
Which approach do you consider the most efficient one for a web environment with PHP and PostgreSQL? Why?
It really depends how much you care about your data (seriously):
Does reliability matter in this case? If the process dies, can you just re-process everything? Or can't you?
Typically when calling a remote web service, you don't want to be calling it twice for the same data item. Perhaps there are side effects (like credit card charges), or maybe it is not a free API...
Anyway, if you don't care about potential duplicate processing, then take the batch approach. It's easy, it's simple, and fast.
But if you do care about duplicate processing, then do this:
SELECT 1 record from the table FOR UPDATE (ie. lock it in a transaction)
UPDATE that record with a status of "Processing"
Commit that transaction
And then
Process the record
Update the record contents, AND
SET the status to "Complete", or "Error" in case of errors.
You can run this code concurrently without fear of it running over itself. You will be able to have confidence that the same record will not be processed twice.
You will also be able to see any records that "didn't make it", because their status will be "Processing", and any errors.
If the data is heavy and so is the load, considering the application is not real time dependant the best approach is most definately getting the needed data and working on all of it, then putting it back.
Efficiency speaking, regardless of language is that if you are opening single items, and working on them individually, you are probably closing the database connection. This means that if you have 1000's of items, you will open and close 1000's of connections. The overhead on this far outweighs the overhead of returning all of the items and working on them.