Wordpress is using MyISAM storage engine. MyISAM does not support transactions.
How wordpress is maintaining transactions?
I mean if wordpress is having two database write operations, how does it ensure atomicity?
Well, as far as I can tell, it doesn't! The only reason there are not much problems with this is, that most write operations are done with a single insert or update (adding a comment, creating a new post...).
In general, most web applications I have seen so far, don't bother too much with transactions, atomicity or even referential integrity, which is quite sad. On the one hand it is sad that so many applications just rely on pure luck that nothing bad happens and on the other hand it might lead to the impression that all these techniques aren't that important when it comes to database stuff.
I would think the transaction would assure atomic correctness at the previous layer of abstraction. When a transaction is occurring it default locks what it is writing. I'm not sure though.
Related
I have heard about this problem and now I am looking for more specific information?
How does it happens, what are the reasons for that, detailed explanation of the mechanism of the deadlock to try to avoid it. How to detect the deadlock, solve it and protect the data from being corrupted because of it. The case is when using MySQL with PHP.
And can I mix the InnoDB and MyISAM? I intend to use innoDB for some majo rtables with many relationships and not that much data, as users, roles, privileges, companies, etc. and use MyISAM for tables that hold more data: customers data, actions data, etc. I would like to use only InnoDB, but the move from MyISAM scares me a bit in terms of speed and stability. And now this deadlocks :(
Deadlocks can occur if you've got two or more independent queries accessing the same resources (tables/rows) at the same time. A real world example:
Two mechanics are working on two cars. At some point during the repair, they both need a screwdriver and a hammer to loosen some badly stuck part. Mechanic A grabs the screwdriver, Mechanic B grabs the hammer, and now neither can continue, as the second tool they need is not available: they're deadlocked.
Now, humans are smart and one of the mechanics will be gracious and hand over their tool to the other: both can continue working. Databases are somewhat stupid, and neither query will be gracious and unlock whatever resource is causing the deadlock. At this point, the DBMS will turn Rambo and force a roll back (or just kill) one or more of the mutually locked queries. That will let one lucky query continue and proceed to get the locks/transactions it needs, and hopefully the aborted ones have smart enough applications handling them which will restart the transactions again later. On older/simpler DBMSs, the whole system would grind to a halt until the DBA went in and did some manual cleanup.
There's plenty of methods for coping with deadlocks, and avoiding them in the first place. One big one is to never lock resources in "random" orders. In our mechanics' case, both should reach for the screwdriver first, before reaching for the hammer. That way one can successfully working immediately, while the other one will know he has to wait.
As for mixing InnodB/MyISAM - MySQL fully supports mixing/matching table types in queries. You can select/join/update/insert/delete/alter in any order you want, just remember that doing anything to a MyISAM table within an InnoDB transaction will NOT make MyISAM magically transaction aware. The MyISAM portions will execute/commit immediately, and if you roll back the InnoDB side of things, MyISAM will not roll back as well.
The only major reason to stick with MyISAM these days is its support for fulltext indexing. Other than that, InnoDB will generally be the better choice, as it's got full transaction support and row-level locking.
I've seen several database cache engines, all of them are pretty dumb (i.e.: keep this query cached for X minutes) and require that you manually delete the whole cache repository after a INSERT / UPDATE / DELETE query has been executed.
About 2 or 3 years ago I developed an alternative DB cache system for a project I was working on, the idea was basically to use regular expressions to find the table(s) involved in a particular SQL query:
$query_patterns = array
(
'INSERT' => '/INTO\s+(\w+)\s+/i',
'SELECT' => '/FROM\s+((?:[\w]|,\s*)+)(?:\s+(?:[LEFT|RIGHT|OUTER|INNER|NATURAL|CROSS]\s*)*JOIN\s+((?:[\w]|,\s*)+)\s*)*/i',
'UPDATE' => '/UPDATE\s+(\w+)\s+SET/i',
'DELETE' => '/FROM\s+((?:[\w]|,\s*)+)/i',
'REPLACE' => '/INTO\s+(\w+)\s+/i',
'TRUNCATE' => '/TRUNCATE\s+(\w+)/i',
'LOAD' => '/INTO\s+TABLE\s+(\w+)/i',
);
I know that these regexs probably have some flaws (my regex skills were pretty green back then) and obviously don't match nested queries, but since I never use them that isn't a problem for me.
Anyway, after finding the involved tables I would alphabetically sort them and create a new folder in the cache repository with the following naming convention:
+table_a+table_b+table_c+table_...+
In case of a SELECT query, I would fetch the results from the database, serialize() them and store them in the appropriate cache folder, so for instance the results of the following query:
SELECT `table_a`.`title`, `table_b`.`description` FROM `table_a`, `table_b` WHERE `table_a`.`id` <= 10 ORDER BY `table_a`.`id` ASC;
Would be stored in:
/cache/+table_a+table_b+/079138e64d88039ab9cb2eab3b6bdb7b.md5
The MD5 being the query itself. Upon a consequent SELECT query the results would be trivial to fetch.
In case of any other type of write query (INSERT, REPLACE, UPDATE, DELETE and so on) I would glob() all the folders that had +matched_table(s)+ in their name all delete all the file contents. This way it wouldn't be necessary to delete the whole cache, just the cache used by the affected and related tables.
The system worked pretty well and the difference of performance was visible - although the project had many more read queries than write queries. Since then I started using transactions, FK CASCADE UPDATES / DELETES and never had the time to perfect the system to make it work with these features.
I've used MySQL Query Cache in the past but I must say the performance doesn't even compare.
I'm wondering: am I the only one who sees beauty in this system? Is there any bottlenecks I may not be aware of? Why do popular frameworks like CodeIgniter and Kohana (I'm not aware of Zend Framework) have such rudimentary DB cache systems?
More importantly, do you see this as a feature worth pursuing? If yes, is there anything I could do / use to make it even faster (my main concerns are disk I/O and (de)serialization of query results)?
I appreciate all input, thanks.
I can see the beauty in this solution, however, I belive it only works for a very specific set of applications. Scenarios where it is not applicable include:
Databases which utilize cascading deletes/updates or any kind of triggers. E.g., your DELETE to table A may cause a DELETE from table B. The regex will never catch this.
Accessing the database from points which do not go through you cache invalidation scheme, e.g. crontab scripts etc. If you ever decide to implement replication across machines (introduce read-only slaves), it may also disturb the cache (because it does not go through cache invalidation etc.)
Even if these scenarios are not realistic for your case it does still answer the question of why frameworks do not implement this kind of cache.
Regarding if this is worth pursuing, it all depends on your application. Maybe you care to supply more information?
The solution, as you describe it, is at risk for concurrency issues. When you're receiving hundreds of queries per second, you're bound to hit a case where an UPDATE statement runs, but before you can clear your cache, a SELECT reads from it, and gets stale data. Additionally, you may run in to issues when several UPDATEs hit the same set of rows in a short time period.
In a broader sense, best practice with caching is to cache the largest objects possible. E.g., rather than having a bunch of "user"-related rows cached all over the place, it's better to just cache the "user" object itself.
Even better, if you can cache whole pages (e.g., you show the same homepage to everyone; a profile page appears identical to almost everyone, etc.), that's even better. One cache fetch for a whole, pre-rendered page will dramatically outperform dozens of cache fetches for row/query level caches followed by re-rending the page.
Long story short: profile. If you take the time to do some measurement, you'll likely find that caching large objects, or even pages, rather than small queries used to build those things, is a huge performance win.
While I do see the beauty in this - especially for environments where resources are limited and can not easily be extended, like on shared hosting - I personally would fear complications in the future: What if somebody, newly hired and unaware of the caching mechanism, starts using nested queries? What if some external service starts updating the table, with the cache not noticing?
For a specialized, defined project that urgently needs a speedup that cannot be helped by adding processor power or RAM, this looks like a great solution. As a general component, I find it too shaky, and would fear subtle problems in the long run that stem from people forgetting that there is a cache to be aware of.
I suspect that the regexes may not provide for every case - certainly they don't seem to deal with the scenario of mixing base table names and the tables themselves. e.g. consider
update stats.measures set amount=50 where id=1;
and
use stats;
update measures set amount=50 where id=1;
Then there's PL/SQL.
Then there's the fact that it depends on every client opting in to an advisory control mechanism i.e. it pre-supposes that all the database access is from machines implementing the caching control mechanism on a shared filesystem.
(as a small point - wouldn't it be simpler to just check the modification times on the data files to determine if the cached version of a query on a defined set of tables is still current, rather then trying to identify if the cache control mechanism has spotted an update - it would certainly be a lot more robust)
Stepping back a bit, implementing this from scratch using a robust architecture would mean that all queries would have to be intercepted by the control mechanism. The control mechanism would probably need a more sophisticated query parser. It certainly requires a common storgae substrate for all the instances of the control mechanism. It probably needs an understanding of the data dictionary - all things which are already implemented by the database itself.
You state that "I've used MySQL Query Cache in the past but I must say the performance doesn't even compare."
I find this rather odd. Certainly when dealing with large result sets from queries, my experience is that loading the data into the heap from a database is a lot faster than unserializing large arrays - although large result sets are rather atypical of web based applications.
When I've tried to speed up database access (after fixing everything else of course) then I've gone down the route of replicating and partitioning data across multiple DBMS instances.
C.
This is related to the problem of session splitting when working with multiple databases in a master-slave configuration. Basically, a similar set of regular expressions are used to determine which tables (or even which rows) are being read from or written to. The system keeps track of which tables were written to and when, and when a read to one of those tables comes up, it's routed to the master. If a query is reading from a table whose data needn't be up-to-the-second accurate, then it's routed to the slave. Generally, information only really needs to be current when it's something a user changed themselves (i.e., editing a user's profile).
They talk about this a good bit in the O'Reilly book High Performance MySQL. I used it quite a bit when developing a system for handling session splits back in the day.
The improvement you describe is to avoid invalidating caches that are guaranteed to not have been affected by an update because they draw data from a different table.
That is of course nice, but I am not sure if it is fine-grained enough to make a real difference. You would still be invaliding lots of caches that did not really need to be (because the update was on the table, but on different rows).
Also, even this "simple" scheme relies on being able to detect the relevant tables by looking at the SQL query string. This can be difficult to do in the general case, because of views, table aliases, and multiple catalogs.
It is very difficult to automatically (and efficiently) detect whether a cache needs to be invalidated. Because of that, you can either use a very simple scheme (such as invalidating on every update, or per table, as in your system, which does not work too well when there are many updates), or a very hand-crafted cache for the specific application with deep hooks into the query logic (probably difficult to write and hard to maintain), or accept that the cache can contain stale data and just refresh it periodically.
On a site with a reasonable amount of traffic , would it matter if the application/business logic is written as stored procedures ,triggers and views , instead of inside the PHP code itself?
What would be the best way to go keeping scalability in mind.
I can't provide you statistics, but unless you plan to change PHP for another language in the future, i can say keeping the business logic in PHP is more "scalability friendly".
Its always easier and cheaper to solve web server load problems than having them in the database. Your database will always need to be lighting quick and just throwing mirrors at it won't solve the problem. The more database slaves you have, the more writes you have to do.
In my experience, you should put business logic in PHP code rather than move it onto the database. Assuming your database is on a separate server, you don't want your database to be busy calculating formulas when requests come in.
Keep your database lightning fast to handle selects, inserts and updates.
I think you will have far better scalibility keeping database code in the database where it can be performance tuned as the number of records gets larger. You will also have better data integrity which is critical to the data even being useful. You don't see a lot of terrabyte sized relational dbs with all their code in the application.
Read some books on database performance tuning and then decide if you want to risk your company's data on application code.
There are several things to consider when trying to decide whether to place the business logic in the database or in the application code.
Will the same database be accessed
from different websites / web
applications? Will the sites /
applications be written in the same
language or in a different language?
If the database will be used from a single site, and the site is written in a single language then this becomes a non-issue. Otherwise, you'll need to consider the added complexity of stored procedures, triggers, etc vs trying to maintain database access logic etc in multiple code bases.
What are relational databases in
general good for and what is MySQL
good for specifically? What is PHP
best at?
This consideration is fairly straight-forward. Relational databases across the board and specifically in any variant of SQL are going to do a great job at inserting, updating, and deleting data. Generally they also handle ATOMIC transactions well. However, most variants of SQL (including MySQL) are not good at complex calculations, on-the-fly date handling, file system access etc.
PHP on the other hand is very fast at handling calculations, dates, file system accesses. By taking a little time you can even design your PHP code to work in such a way that records are only retrieved once and then stored when necessary.
What are you most familiar /
comfortable with using?
Obviously it tends to make more sense to use the tool with which you are most familiar.
As a last point consider that just because a drill can be used to cut sheet rock or because a hammer can be used to drive a screw doesn't mean that they should be used for these things. Sometimes I think that programmers do more potential damage by trying to make more powerful tools that do everything rather than making simpler tools that do one thing really, really well.
A well done PHP application should be enought, but keep in mind that it also requires you to do the less calls to the database you can. Store values you'll need later in PHP, shorten queries, cache, etc.
MySQL optimization is always a must, as it will also decrease the amount of databse calls by PHP, and thus getting a better performance. Therefore, there's no way you can't think of stored procedures, etc, if your aim is to increase performance. But MySQL by itself would't be enought if your PHP code isn't well done (lots of unecessary database calls), that's why I think PHP must be well coded, keeping in mind the hole process while developing it, so that unecessary stuff doesn't get in the way. Cache for instance, in "duet" with proper MySQL, is a great boost on performance.
My POV, even not having much experience in developing large applications is to write business logic in the DB for some reasons:
1 - Maintainability, I think that languages deprecate functions and changes many other things in a short time period, so if PHP changes version, you'll need to adapt your code to the new version
2 - DBs tends to be more language stable, so when a new version of a RDBMS comes out, it usually doesn't change many things in the way you write your queries or SPs, or it even doesn't change. Writing your logic in DB will reduce code adaptation because of a new DB version
3 - A RDBMS is more likely to be alive for a long period rather than a programming language. Also, as your data is critical, there is a big worry from the RDBMS developers for automatic migration of your whole data to the new RDBMS version, including your SPs. When clipper died, there were no ways to migrate systems to a new programming language, they had to be completely rewritten.
4 - If you think someday to change completely the language you are writing the application for some reason(language death, for example), the only thing to be rewritten will be the presentation and the SP calls, not business logic.
I'd like to know from other people here if what I pointed out makes sense, and if not, why. I'm on the same situation as Sabeen Malik, I'm thinking to begin my first huge project and I'm tending towards SPs because of what I wrote. So it's time to correct my POV if it's not so correct.
MySQL sucks at using advanced DB techniques, it's simple and fast. PHP, being a dynamic language, makes processing data very easy. Therefore, it usually makes sense to use PHP.
I am currently working on a very specialized PHP framework, which might have to handle large database transfers.
For example:
Take half of the whole user count; this should be every day's workspace for the framework.
So, if my framework is required by big projects, is it recommend to use single transactions with multiple queries (e.g. doing many things in 1 query with JOINs(?)), or is auto-commit preferred?
If possible, post some blog entries which have discussed this problem.
Thank you. ;)
MyISAM is transactionless, so autocommit does not affect it.
As for InnoDB, autocommit makes repeating queries hundreds of times as slow.
The best decision is of course doing everything set-based, but if you have to do queries in a loop, turn autocommit off.
Also note that "multiple queries" and "doing many things in one query with JOIN" are completely different things.
The latter is an atomic operation which succeeds or fails at once even on MyISAM.
Autocommit does not affect its performance as such: everything inside this query is always done in a single transaction.
Doing many things in one query is actually a preferred way to do many things.
Joins are not just useful they are necessary in any but the simplest of datastructures. To consider writing without them shows a fundamental lack of understanding of relational database design and access to me. Not to use joins would usually result in getting the wrong answer to your question in a select. You may get away without joins in inserts, updates and deletes, but they are often used and useful there as well.
I'm looking to scale an existing phpBB installation by separating the read queries from the write queries to two separate, replicated MySQL servers. Anyone succeeded in doing this, specifically with phpBB?
The biggest concern I have so far is that it seems like the queries are scattered haphazardly throughout the code. I'd love to hear if anyone else did this, and if so, how it went / what was the process.
You could try MySQL Proxy which would be an easy way to split the queries without changing the application.
Just add more RAM. Enough RAM to hold the entire database. You'll be surprised how fast your inefficient script will fly. Memory forgives a lot of database scaling mistakes.
I know this was asked a long time ago, but I'd like to share what I experienced, in case it can help anyone.
If your problem are table locks, and knowing that the default storage engine of phpbb in that day was MyISAM, have you looked at moving to InnoDB storage engine?
Just find out which tables are most frequently locked, and convert those to InnoDB. The sessions table is the first candidate here, although you may want to look at other optimizations (such as storing session data only in memcache or something) if that is your main bottleneck.