Best way to handle concurrency issues - php

i have a LAPP (linux, apache, postgresql and php) environment, but the question is pretty the same both on Postgres or Mysql.
I have an cms app i developed, that handle clients, documents (estimates, invoices, etc..) and other data, structured in 1 postgres DB with many schemas (one for each our customer using the app); let's assume around 200 schemas, each of them used concurrently by 15 people (avg).
EDIT: I do have an timestamp field named last_update on every table, and a trigger that update the timestamp every time the row is update.
The situation is:
People Foo and Bar are editing the document 0001, using a form with every document details.
Foo change the shipment details, for example.
Bar change the phone numbers, and some items in the document.
Foo press the 'Save' button, the app update the db.
Bar press the 'Save' button after bar, resending the form with the old shipment details.
In the database, the Foo changes have been lost.
The situation i want to have:
People Foo, Bar, John, Mary, Paoul are editing the document 0001, using a form with every document details.
Foo change the shipment details, for example.
Bar and the others change something else.
Foo press the 'Save' button, the app update the db.
Bar and the others get an alert 'Warning! this document has been changet by someone else. Click here to load the actuals data'.
I've wondered to use ajax to do this; simply using an hidden field with the id of the document and the last-updated timestamp, every 5 seconds check if the last-updated time is the same and do nothing, else, show the alert dialog box.
So, the page check-last-update.php should look something like:
<?php
//[connect to db, postgres or mysql]
$documentId = isset($_POST['document-id']) ? $_POST['document-id'] : 0;
$lastUpdateTime = isset($_POST['last-update-time']) ? $_POST['last-update-time'] : 0;
//in the real life i sanitize the data and use prepared statements;
$qr = pg_query("
SELECT
last_update_time
FROM
documents
WHERE
id = '$documentId'
");
$ray = pg_fetch_assoc($qr);
if($ray['last_update_time'] > $lastUpdateTime){
//someone else updated the document since i opened it!
echo 'reload';
}else{
echo 'ok';
}
?>
But i dont like to stress the db every 5 seconds for every user that have one (or more...) documents opened.
So, what can be another efficent solution without nuking the db?
I thought to use files, creating for example an empty txt file for each document, and everytime the document is updated, i 'touch' the file updating the 'last modified time' as well... but i guess that this would be slower than db and give problems when i have much users editing the same document.
If someone else have a better idea or any suggestion, please describe it in details!
* - - - - - UPDATE - - - - - *
I definitely choosen to NOT hit the db for check the 'last update timestamp', dont mind if the query will be pretty fast, the (main) database server has other tasks to fullfill, dont like the idea to increase his overload for that thing.
So, im taking this way:
Every time a document is updated by someone, i must do something to sign the new timestamp outside the db environment, e.g. without asking the db. My ideas are:
File-system: for each document i create an empry txt files named as the id of the document, everytime the document is update, i 'touch' the file. Im expecting to have thousands of those empty files.
APC, php cache: this will be probably a more flexible way than the first one, but im wondering if keeping thousands and thousands of data permanently in the apc wont slow down the php execution itself, or consume the server memory. Im little bit afraid to choose this way.
Another db, sqlite or mysql (that are faster and lighter with simple db structures) used to store just the documents ID and timestamps.
Whatever way i choose (files, apc, sub-db) im seriously thinking to use another web-server (lighttp?) on a sub-domain, to handle all those.. long-polling requests.
YET ANOTHER EDIT:
The file's way wouldnt work.
APC can be the solution.
Hitting the DB can be the solution too, creating a table just to handle the timestamps (with only two column, document_id and last_update_timestamp) that need to be as fast and light as possible.
Long polling: that's the way i'll choose, using lighttpd under apache to load static files (images, css, js, etc..), and just for this type of long-polling; This will lighten the apache2 load, specially for the polling.
Apache will proxy-up all those request to lighttpd.
Now, i only have to decide between db solution and APC solution..
p.s: thanks to all whom already answered me, you have been really usefull!

I agree that I probably wouldn't hit the database for this. I suppose I would use APC cache (or some other in-memory cache) to maintain this information. What you are describing is clearly optimistic locking at the detailed record level. The higher the level in the database structure the less you need to deal with. It sounds like you want to check with multiple tables within a structure.
I would maintain a cache (in APC) of the IDs and the timestamps of the last updated time keyed by the table name. So for example I might have an array of table names where each entry is keyed by ID and the actual value is the last updated timestamp. There are probably many ways to set this up with arrays or other structures but you get the idea. I would probably add a timeout to the cache so that entries in the cache are removed after a certain period of time - i.e., I wouldn't want the cache to grow and assume that 1 day old entries aren't useful anymore).
With this architecture you would need to do the following (in addition to setting up APC):
on any update to any (applicable) table, update the APC cache entry with the new timestamp.
within ajax just go as far "back" as php (to obtain the APC cache to check the entry) rather than all of the way "back" to the database.

I think you can use a condition in the UPDATE statement like WHERE ID=? AND LAST_UPDATE=?.
The idea is that you will only succeed in updating when you are the last one reading that row. If someone else has committed something, you will fail, and once you know you've failed, you can query the changes.

Hibernate uses a version field to do that. Give every table such a field and use a trigger to increment it on every update. When storing an update, compare the current version with the version when the data was read earlier. If those don't match, throw an exception. Use transactions to make the check-and-update atomic.

You will need some type of version stamp field for each record. What it is doesn't matter as long as you can guarantee that making any change to a record will result in that version stamp being different. Best practice is to then check and make sure the loaded record's version stamp is the same as the version stamp in the DB when the user clicks save, and if it's different handle it.
How you handle it is up to you. At the very least you'd want to offer to reload from the DB so the user can verify that they still want to save. One up from that would be to attempt to merge their changes into the new DB record and then ask them to verify that the merge worked correctly.
If you want to periodically poll any DB capable of handling your system should be able to take the poll load. 10 users polling once every 5 seconds is 2 transactions per second. This is a trivial load, and should be no problem at all. To keep the average load close to the actual load, just jitter the polling time slightly (instead of doing it exactly every 5 seconds, do it every 4-6 seconds, for example).

Donnie's answer (polling) is probably your best option - simple and works. It'll cover almost every case (its unlikely a simple PK lookup would hurt performance, even on a very popular site).
For completeness, and if you wanted to avoid polling, you can use a push-model. There's various ways described in the Wikipedia article. If you can maintain a write-through cache (everytime you update the record, you update the cache), then you can almost completely eliminate the database load.
Don't use a timestamp "last_updated" column, though. Edits within the same second aren't unheard of. You could get away with it if you add extra information (server that did the update, remote address, port, etc) to ensure that, if two requests came in at the same second, to the same server, you could detect the difference. If you need that precision, though, you might as well use a unique revision field (it doesn't necessarily have to be an incrementing integer, just unique within that record's lifespan).
Someone mentioned persistent connections - this would reduce the setup cost of the polling queries (every connection consumes resources on the database and host machine, naturally). You would keep a single connection (or as few as possible) open all the time (or as long as possible) and use that (in combination with caching and memoization, if desired).
Finally, there are SQL statements that allow you to add a condition on UPDATE or INSERT. My SQl is really rusting, but I think its something like UPDATE ... WHERE .... To match this level of protection, you would have to do your own row locking prior to sending the update (and all the error handling and cleanup that might entail). Its unlikely you'd need this; I'm just mentioning it for completness.
Edit:
Your solution sounds fine (cache timestamps, proxy polling requests to a another server). The only change I'd make is to update the cached timestamps on every save. This will keep the cache fresher. I'd also check the timestamp directly from the db when saving to prevent a save sneaking in due to stale cache data.
If you use APC for caching, then a second HTTP server doesn't make sense - you'd have to run it on the same machine (APC uses shared memory). The same physical machine would be doing the work, but with the additional overhead of a second HTTP server. If you want to off load the polling requests to a second server (lighttpd, in your case), then it would be better to setup lightttpd in front of Apache on a second physical machine and use a shared caching server (memcache) so that the lighttpd server can read the cached timestamps, and Apache can update the cached timestamps. The rationale for putting lighttpd in front of Apache is, if most requests are polling requests, to avoid the heavier-weight Apache process usage.
You probably don't need a second server at all, really. Apache should be able to handle the additional requests. If it can't, then I'd revisit your configuration (specifically the directives that control how many worker processes you run and how many requests they are allowed to handle before being killed).

Your approach of querying the database is the best one. If you do it every 5 seconds and you have 15 concurrent users then you're looking at ~3 queries a second. It should be a very small query too, returning only one row of data. If your database can't handle 3 transactions a second then you might have to look at a better database because 3 queries/second is nothing.
Timestamp the records in the table so you can quickly see if anything has changed without having to diff each field.

This is slightly off topic, but you can use the PEAR package (or PECL package, I forget which) xdiff to send back good user guidance when you do get a collision.

First off only update the fields that have changed on when writing to the database, this will decrease database load.
Second, query the timestamp of the last update, if you have a older timestamp then the current version in the database then throw the warning to the client.
Third is to somehow push this information to the client, though some kind of persistent connection with the server, enabling a concurrent two way connection.

Polling is rarely a nice solution.
You could do the timstamp check only when the user (with the open document) is doing something active with the document like scrolling, moving the mouse over it or starts to edit. Then the user gets an alert if the document has been changed.
.....
I know it was not what you asked for but ... why not a edit-singleton?
The singleton could be a userID column in the document-table.
If a user wants to edit the document, the document is locked for edit by other users.
Or have edit-singletons on the individual fields/groups of information.
Only one user can edit the document at a time. If another user has the document open and want to edit a single timestamp check reveal that the document has been altered and is reloaded.
With a singleton there is no polling and only one timestamp check when the user "touches" and/or wants to edit the document.
But perhaps a singleton mechanism doesn't fit your system.
Regards
Sigersted

Ahhh, i though it was easyer.
So, lets make the point: i have a generic database (pgsql or mysql doesn't matter), that contains many generic objects.
I have $x (actually $x = 200, but is growing, hoping will reach 1000 soon) of exact copy of this database, and for each of them up to 20 (avg 10) users for 9 hours at day.
If one of those users is viewing a record, any record, i must advice him if someone edit the same record.
Let's say Foo is watching the document 0001, sit up for a coffee, Bar open and edit the same document, when Foo come back he must see an 'Warning, someone else edited this document! click here to refresh tha page.'.
That'all i need atm, probably i'll extend this situation, adding a way to see the changes and rollback, but this is not the point.
Some of you suggested to check the 'last update' timestamp only when foo try to save the document; Can be a solution too, but i need something in real-time ( 10 sec deelay ).
Long polling, bad way, but seem to be the only one.
So, what i've done:
Installed Lighttp on my machine (and php5 as fastcgi);
Loaded apache2's proxy module (all, or 403 error will hit you);
Changed the lighttpd port from 80 (that is used by apache2) to 81;
Configured apache2 to proxying the request from mydomain.com/polling/* to polling.mydomain.com (served with Lighttp)
Now, i have another sub http-service that i'll use both for polling and load static content (images, etc..), in order to reduce the apache2's load.
Becose i dont want to nuke the database for the timestamp check, i've tryed some caches system (that can be called from php).
APC: quite simple to install and manage, very lightweight and faster, this would be my first choice.. if only the cache would be sharable between two cgi process (i need to store in cache a value from apache2's php process, and read it from lighttpd's php process)
Memcached: around 4-5 times slower than APC, but run as a single process that can be touched everywhere in my environment. I'll go with this one, atm. (even if is slower, the use i'll do of it is relatively simple).
Now, i just have to try this system loading some test datas to see ho will move 'under pressure' and optimize it.
I suppost this environment will work for other long-polling situations (chat?)
Thanks to everyone who gave me hear!

I suggest: when you first query the record that might be changed, hang onto a local copy. When "updating", compare the copy in the locked table/row against your copy, and if it's changed, kick it back to the user.

Related

Which caching technique to be used for apache php

Currently i m using shared hosting domain for my site .But we have currently near about 11,00,000 rows in one of the tables.So its taking a lot of time to load the webpage.So we want to implement the database caching techniques like APC or memcache for our site.But in shared domain we dont have those facilities available,we have only eaccelerator.But eaccelerator does not cache db calls,If i m not wrong.So considering all these points we want to move to VPS and in this case.which database caching technique we need to use APC or memcache to decrease the page load time...Please guide on VPS and better caching technique of two
we have similar website and we use APC
APC will cache the opcode as well the html that is generated. This helps to avoid unrequired hits to the page
you should also enable caching on mysql to cache results of your query
I had a task where i needed to fetch rows from a database table that had more than 100.000 record. it was a scrollable page. So what i did was to fetch the first 50 records and cache the next 50 in the first call. and on scroll down events i wrote an ajax request to check if the data is available in cache; if not i fetched it from the database and also cached the next 50. It worked pretty well and solved the inconvenient load time.
if you have a similar scenario you might benefit from this approach.
ps: I used memcache.
From your comment I take it you're doing a LIKE %..% query and want to paginate the result. First of all, investigate whether FULLTEXT indices are an option for you, as they should perform better. If that's not an option, you can add a simple cache like so:
Treat each unique search term as an id, i.e. if in your URL you have ..?search=foobar, then "foobar" is the id of the result set. Keep that in all your links, e.g. ..?search=foobar&page=2.
If the result set does not yet exist (see below), create it:
Query the database with your slow query.
Get all the results into an array. Don't overdo it, you don't want to be storing hundreds of megabytes.
Create a unique filename per query, e.g. sha1($query), or maybe sha1(strtolower($query)).
serialize the data and store it in the file.
Get the data from the file, unserialize it, display the portion of the array corresponding to the requested page.
Occasionally, delete old cached results. You can do that with something like if (rand(0, 100) == 1) .., which will run the cleanup job every 100 queries on average. Strike a balance between server load and data freshness. Cache invalidation is a topic whole books can be written about, BTW.
That's a simple poor man's cache implementation. It's not great, but if you have absolutely nothing else to work with, it's better than running slow queries over and over.
APC is Alternative PHP Cache and works only with PHP. Whereas Memcahced will work independently with any language.

How does php and apache handle multiple requests?

How does PHP handle multiple requests from users? Does it process them all at once or one at a time waiting for the first request to complete and then moving to the next.
Actually, I'm adding a bit of wiki to a static site where users will be able to edit addresses of businesses if they find them inaccurate or if they can be improved. Only registered users may do so. When a user edits a business name, that name along with it's other occurrences is changed in different rows in the table. I'm a little worried about what would happend if 10 users were doing this simultaneously. It'd be a real mishmash of things. So does PHP do things one at time in order received per script (update.php) or all at once.
Requests are handled in parallel by the web server (which runs the PHP script).
Updating data in the database is pretty fast, so any update will appear instantaneous, even if you need to update multiple tables.
Regarding the mish mash, for the DB, handling 10 requests within 1 second is the same as 10 requests within 10 seconds, it won't confuse them and just execute them one after the other.
If you need to update 2 tables and absolutely need these 2 updates to run subsequently without being interrupted by another update query, then you can use transactions.
EDIT:
If you don't want 2 users editing the same form at the same time, you have several options to prevent them. Here are a few ideas:
You can "lock" that record for edition whenever a user opens the page to edit it, and not let other users open it for edition. You might run into a few problems if a user doesn't "unlock" the record after they are done.
You can notify in real time (with AJAX) a user that the entry they are editing was modified, just like on stack overflow when a new answer or comment was posted as you are typing.
When a user submits an edit, you can check if the record was edited between when they started editing and when they tried to submit it, and show them the new version beside their version, so that they manually "merge" the 2 updates.
There probably are more solutions but these should get you started.
It depends on which version of Apache you are using and how it is configured, but a common default configuration uses multiple workers with multiple threads to handle simultaneous requests. See http://httpd.apache.org/docs/2.2/mod/worker.html for a rundown of how this works. The end result is that your PHP scripts may together have dozens of open database connections, possibly sending several queries at the exact same time.
However, your DBMS is designed to handle this. If you are only doing simple INSERT queries, then your code doesn't need to do anything special. Your DBMS will take care of the necessary locks on its own. Row-level locking will be fastest for multiple INSERTs, so if you use MySQL, you should consider the InnoDB storage engine.
Of course, your query can always fail whether it's due to too many database connections, a conflict on a unique index, etc. Wrap your queries in try catch blocks to handle this case.
If you have other application-layer concerns about concurrency, such as one user overwriting another user's changes, then you will need to handle these in the PHP script. One way to handle this is to use revision numbers stored along with your data, and refusing to execute the query if the revision number has changed, but how you handle it all depends on your application.

Getting all data once for future use

Well this is kind of a question of how to design a website which uses less resources than normal websites. Mobile optimized as well.
Here it goes: I was about to display a specific overview of e.g. 5 posts (from e.g. a blog). Then if I'd click for example on the first post, I'd load this post in a new window. But instead of connecting to the Database again and getting this specific post with the specific id, I'd just look up that post (in PHP) in my array of 5 posts, that I've created earlier, when I fetched the website for the first time.
Would it save data to download? Because PHP works server-side as well, so that's why I'm not sure.
Ok, I'll explain again:
Method 1:
User connects to my website
5 Posts become displayed & saved to an array (with all its data)
User clicks on the first Post and expects more Information about this post.
My program looks up the post in my array and displays it.
Method 2:
User connects to my website
5 Posts become displayed
User clicks on the first Post and expects more Information about this post.
My program connects to MySQL again and fetches the post from the server.
First off, this sounds like a case of premature optimization. I would not start caching anything outside of the database until measurements prove that it's a wise thing to do. Caching takes your focus away from the core task at hand, and introduces complexity.
If you do want to keep DB results in memory, just using an array allocated in a PHP-processed HTTP request will not be sufficient. Once the page is processed, memory allocated at that scope is no longer available.
You could certainly put the results in SESSION scope. The advantage of saving some DB results in the SESSION is that you avoid DB round trips. Disadvantages include the increased complexity to program the solution, use of memory in the web server for data that may never be accessed, and increased initial load in the DB to retrieve the extra pages that may or may not every be requested by the user.
If DB performance, after measurement, really is causing you to miss your performance objectives you can use a well-proven caching system such as memcached to keep frequently accessed data in the web server's (or dedicated cache server's) memory.
Final note: You say
PHP works server-side as well
That's not accurate. PHP works server-side only.
Have you think in saving the posts in divs, and only make it visible when the user click somewhere? Here how to do that.
Put some sort of cache between your code and the database.
So your code will look like
if(isPostInCache()) {
loadPostFromCache();
} else {
loadPostFromDatabase();
}
Go for some caching system, the web is full of them. You can use memcached or a static caching you can made by yourself (i.e. save post in txt files on the server)
To me, this is a little more inefficient than making a 2nd call to the database and here is why.
The first query should only be pulling the fields you want like: title, author, date. The content of the post maybe a heavy query, so I'd exclude that (you can pull a teaser if you'd like).
Then if the user wants the details of the post, i would then query for the content with an indexed key column.
That way you're not pulling content for 5 posts that may never been seen.
If your PHP code is constantly re-connecting to the database you've configured it wrong and aren't using connection pooling properly. The execution time of a query should be a few milliseconds at most if you've got your stack properly tuned. Do not cache unless you absolutely have to.
What you're advocating here is side-stepping a serious problem. Database queries should be effortless provided your database is properly configured. Fix that issue and you won't need to go down the caching road.
Saving data from one request to the other is a broken design and if not done perfectly could lead to embarrassing data bleed situations where one user is seeing content intended for another. This is why caching is an option usually pursued after all other avenues have been exhausted.

What are the number of ways in which my approach to a news-feed is wrong?

This question has been asked a THOUSAND times... so it's not unfair if you decide to skip reading/answering it, but I still thought people would like to see and comment on my approach...
I'm building a site which requires an activity feed, like FourSquare.
But my site has this feature for the eye-candy's sake, and doesn't need the stuff to be saved forever.
So, I write the event_type and user_id to a MySQL table. Before writing new events to the table, I delete all the older, unnecessary rows (by counting the total number of rows, getting the event_id lesser than which everything is redundant, and deleting those rows). I prune the table, and write a new row every time an event happens. There's another user_text column which is NULL if there is no user-generated text...
In the front-end, I have jQuery that checks with a PHP file via GET every x seconds the user has the site open. The jQuery sends a request with the last update "id" it received. The <div> tags generated by my backend have the "id" attribute set as the MySQL row id. This way, I don't have to save the last_received_id in memory, though I guess there's absolutely no performance impact from storing one variable with a very small int value in memory...
I have a function that generates an "update text" depending on the event_type and user_id I pass it from the jQuery, and whether the user_text column is empty. The update text is passed back to jQuery, which appends the freshly received event <div> to the feed with some effects, while simultaneously getting rid of the "tail end" event <div> with an effect.
If I (more importantly, the client) want to, I can have an "event archive" table in my database (or a different one) that saves up all those redundant rows before deleting. This way, event information will be saved forever, while not impacting the performance of the live site...
I'm using CodeIgniter, so there's no question of repeated code anywhere. All the pertinent functions go into a LiveUpdates class in the library and model respectively.
I'm rather happy with the way I'm doing it because it solves the problem at hand while sticking to the KISS ideology... but still, can anyone please point me to some resources, that show a better way to do it? A Google search on this subject reveals too many articles/SO questions, and I would like to benefit from the experience any other developer that has already trawled through them and found out the best approach...
If you use proper indexes there's no reason you couldn't keep all the events in one table without affecting performance.
If you craft your polling correctly to return nothing when there is nothing new you can minimize the load each client has on the server. If you also look into push notification (the hybrid delayed-connection-closing method) this will further help you scale big successfully.
Finally, it is completely unnecessary to worry about variable storage in the client. This is premature optimization. The performance issues are going to be in the avalanche of connections to the web server from many users, and in the DB, tables without proper indexes.
About indexes: An index is "proper" when the most common query against a table can be performed with a seek and a minimal number of reads (like 1-5). In your case, this could be an incrementing id or a date (if it has enough precision). If you design it right, the operation to find the most recent update_id should be a single read. Then when your client submits its ajax request to see if there is updated content, first do a query to see if the value submitted (id or time) is less than the current value. If so, respond immediately with the new content via a second query. Keeping the "ping" action as lightweight as possible is your goal, even if this incurs a slightly greater cost for when there is new content.
Using a push would be far better, though, so please explore Comet.
If you don't know how many reads are going on with your queries then I encourage you to explore this aspect of the database so you can find it out and assess it properly.
Update: offering the idea of clients getting a "yes there's new content" answer and then actually requesting the content was perhaps not the best. Please see Why the Fat Pings Win for some very interesting related material.

Caching table results for better performance... how?

First of all, the website I run is hosted and I don't have access to be able to install anything interesting like memcached.
I have several web pages displaying HTML tables. The data for these HTML tables are generated using expensive and complex MySQL queries. I've optimized the queries as far as I can, and put indexes in place to improve performance. The problem is if I have high traffic to my site the MySQL server gets hammered, and struggles.
Interestingly - the data within the MySQL tables doesn't change very often. In fact it changes only after a certain 'event' that takes place every few weeks.
So what I have done now is this:
Save the HTML table once generated to a file
When the URL is accessed check the saved file if it exists
If the file is older than 1hr, run the query and save a new file, if not output the file
This ensures that for the vast majority of requests the page loads very fast, and the data can at most be 1hr old. For my purpose this isn't too bad.
What I would really like is to guarantee that if any data changes in the database, the cache file is deleted. This could be done by finding all scripts that do any change queries on the table and adding code to remove the cache file, but it's flimsy as all future changes need to also take care of this mechanism.
Is there an elegant way to do this?
I don't have anything but vanilla PHP and MySQL (recent versions) - I'd like to play with memcached, but I can't.
Ok - serious answer.
If you have any sort of database abstraction layer (hopefully you will), you could maintain a field in the database for the last time anything was updated, and manage that from a single point in your abstraction layer.
e.g. (pseudocode): On any update set last_updated.value = Time.now()
Then compare this to the time of the cached file at runtime to see if you need to re-query.
If you don't have an abstraction layer, create a wrapper function to any SQL update call that does this, and always use the wrapper function for any future functionality.
There are only two hard things in
Computer Science: cache invalidation
and naming things.
—Phil Karlton
Sorry, doesn't help much, but it is sooooo true.
You have most of the ends covered, but a last_modified field and cron job might help.
There's no way of deleting files from MySQL, Postgres would give you that facility, but MySQL can't.
You can cache your output to a string using PHP's output buffering functions. Google it and you'll find a nice collection of websites explaining how this is done.
I'm wondering however, how do you know that the data expires after an hour? Or are you assuming the data wont change that dramatically in 60 minutes to warrant constant page generation?

Categories