Im on an optimization crusade for one of my sites, trying to cut down as many mysql queries as I can.
Im implementing partial caching, which writes .txt files for various modules of the site, and updates them on demand. I've came across one, that cannot remain static for all the users, so the .txt file thats written on the HD, will need to be altered on the fly via php.
Which is done via
flush();
ob_start();
include('file.txt');
$contents = ob_get_clean();
Then I modify the html in the $contents variable, and echo it out for different users.
Alternatively, I can leave it as it is, which runs a mysql query, which queries a small table that has category names (about 13 of them).
Which one is less expensive? Running a query every single time.... or doing it via the method I posted above, to inject html code on the fly, into a static .txt file?
Reading the file (save in very weird setups) will be minutely faster than querying the DB (no network interaction, &c), but the difference will hardly be measurable -- just try and see if you can measure it!
Optimize your queries first! Then use memcache or similar caching system, for data that is accessed frequently and then you can add file caching. We use all three combined and it runs very smooth. Small optimized queries aren't so bad. If your DB is in local server - network is not an issue. And don't forger to use MySQL query cache (i guess you do use MySQL).
Where is your the performance bottleneck?
If you don't know the bottleneck, you can't make any sensible assessment about optimisations.
Collect some metrics, and optimise accordingly.
Try both and choose the one that either is a clear winner or if not available, more maintainable. This depends on where the DB is, how much load it's getting, and whether you'll need to run more than one application instance (then they'd need to share this file on the network and it's not local anymore).
Here are the patterns that work for me when I'm refactoring PHP/MySQL site code.
The number of queries per page is absolutely critical - one complex query with joins is fastest as long as indexes are proper. A single page can almost always be generated with five or fewer queries in my experience, plus good use of classes and arrays of classes. Often one query for the session and one query for the app.
After indexes the biggest thing to work on is the caching configuration parameters.
Never have queries in loops.
Moving database queries to files has never been a useful strategy, especially since it often ends up screwing up your query integrity.
Alex and the others are right about testing. If your pages are noticeably slow, then they are slow for a reason (or reasons) - don't even start changing anything until you know what the reasons are and can measure the consequences of your changes. Refactoring by guessing is always a losing strategy espeically when (as in your case) you're adding complexity.
Related
At the moment I am writing a series of functions for fetching Dota 2 matches from the Steam API. When someone fetches their games, I have to (for my use) take a history of all of their games (lets say 3 api calls), then all the details from each of those games (so if there's 200 games, another 200 api calls). This takes a long time, and so far I'm programming all of the above to be in one php file "FetchMatchHistory.php", which is run by the user clicking a button on the web page.
Another thing that is making me feel it should be in one file, is that I imagine it is probably good practice to put all of the information (In this case, match history, match details, id's etc.) into the database all at once, so that there doesn't have to be null values in the database?
My question is whether or not having a function that takes a very long time should be in just one PHP file (should meaning, is generally considered good practice), or whether I should break the seperate functions down into smaller files. This is very context dependent, I know, so please forgive me.
Is it common to have API calls spanning several PHP files if that is what you are making? Is there a security/reliability issue with having only one file doing all the leg-work (so to speak)?
Good practice is to have a number of relevant functions grouped together in a php file that describes them, for organize them better and also for caching reasons for the parts that get updated more slowly than other.
But speaking of performance, i doubt you'll get the performance improvements you seek by just moving code through files.
Personally i had the habit to put everything in a file, consistently:
making my files fat
hard-to-update
hard-to-read
hard to find the thing i want (Ctrl+F meltdown)
wasting bandwidth uploading parts they did not need to be updated
virtually disabling caching on server
I dont know if any of the above is of any use for your App, but breaking files into their relevant files/places did my life easier.
UPDATE:
About the database practice, you're going to query only the parts you want to be updated.
I dont understand why you split that logic in files, there's not going to give you performance. Instead, what is going to give you performance is to update only the relevant parts and having tables with relevant content. Speaking of multiple tables have a lot more sense, since you could use them as pointers to the large data contained in another tables, reducing the possible waste of data having just one table.
Also, dont forget a single table has limitations; I personally try to have as few columns as possible. Adding more and more and a day you can't add more because of the row limit. There is a maximum number of columns in general, but this limit rarely ever get maxed by developer; the increased per-row content itself is going to suck out that limit.
Whether to split server side code to multiple files or keep it in a single one is an organizational issue, more than a security/reliability one...
I don't think it's more secure to keep your code in separate source files.
It's entirely a of how you prefer to organize and mantain your code base.
Usually, I separate it when I can find some kind of "categories" in my code.
Obviously, if you write OO code, the most common choice is to keep each class in a single file...
I'm the webmaster for a major US university. We have a great deal of requests on our website, which I've built and been in charge of for the last 7 years or so. I've been building ever-more-complex features into our website and it's always been my practice to put as much of the programming burden on our multi-processor Microsoft SQL server as possible - using stored procedures, views, etc, and fill-in what can't be done with PHP, ASP, or Perl from the IIS web server. Both servers are very powerful and capable machines. Since I've been doing this alone for so long without anyone else to brainstorm with, I'm curious if my approach is ideal for even higher load situations we'll have in the future.
My question is: Is it better practice to place more of the load burden on the SQL server using nested SELECT statements, views, stored procedures and aggregate functions, or should I be pulling multiple simpler queries and processing through them using server-side compile-time scripts like PHP? Keep on keepin' on or come up with a better way?
I've recently become more interested in performance after I did some load traces and learned just how much I've been putting on the shoulders of the SQL server. Both the web server and SQL servers are fast and responsive throughout the day, and almost without regard for how much I put on them, but I'd like to be ready and have trained myself and upgraded my existing code optimized best practices in mind by the time it becomes important.
Thanks for your advice and input.
You put each layer in your stack to use in the domain it fits best.
There is no use in having your database server send 1000 rows and using PHP to filter them if a WHERE-clause or GROUP-clause would suffice. It's not optimal to call the database to add two integers (SELECT 5+9 works fine, but php can do it itself, and you save the roundtrip).
You will probably want to look into scalability: what parts of your application can be divided unto multiple processes? If you're still just using 2 layers (script & db), there is a lot of room for scaling there. But always start with the bottleneck first.
Some examples: host static contents on CDN, use caching for your pages, read about nginx and memcached, use nosql (mongoDB), consider sharding, consider replication.
My opinion is that it's generally (mostly) best to favor letting the web servers do the processing. Two points:
First is scalability. Once your application gets enough usage, you'll need to start worrying about load balancing. And it's a lot easier to drop in a couple of extra web servers pointing to a common database than it is to set up a distributed database cluster. So best to take as much strain away from the Database as you can and keep it on a single machine for as long as possible.
The second point i'd like to make is about optimizing the queries. This will depend a lot on the queries you are using, and the database backend. When i first started working with databases, i fell into the trap of making elaborate SQL queries with multiple JOINs that fetched exactly the data i wanted, even if it was from four or five different tables. I reasoned that "That's what the database is there for - lets get it to do the hard work"
I quickly found that these queries took way too long to execute, and often ended up blocking the database from other requests. While it may seam inefficient to split your query into multiple requests (for example in a for loop), you'll often find that executing multiple small queries with fast indexes will make your application run far more smoothly than trying to pass all the hard work to the database
Firstly, you might want to check if there is any load which can be removed entirely by client side caching (.js, .css, static HTML and images), and use of technologies such as AJAX to do partial updates of screens - this will remove load on both web and sql servers.
Secondly, see if there is sql load which can be reduced by web server caching - e.g. static or low refresh data - if you have a lot of 'content' pages on your systems, have a look at common CMS caching techniques which will scale to allow many more users to view the same data without rebuilding the page or hitting the database.
I tend to do as much as possible outside the db, viewing db calls as expensive/time-intensive.
For example, when performing a select on a user table with fields name_given and name_family, I could fatten the query to return a column called full_name built by concatenation. But that kind of thing can be easily done in a model on your server-side scripting language (PHP, Ruby, etc).
Of course, there are cases when the db is the more "natural" place to perform an operation. But, in general, I incline more towards putting the load on the web server and optimize there with many of the techniques noted in other answers.
I'm having somewhat theoretical question: I'm designing my own CMS/app-framework (as many PHP programmers on various levels did before... and always will) to either make production-ready solution or develop various modules/plugins that I'll use later.
Anyway, I'm thinking on gathering SQL connections from whole app and then run them on one place:
index.php:
<?php
include ('latestposts.php');
include ('sidebar.php');
?>
latestposts.php:
<?php
function gather_data ($arg){ $sql= ""; }
function draw ($data) {...}
?>
sidebar.php:
<?php
function gather_data ($arg){ $sql= ""; }
function draw ($data) {...}
?>
Now, while whole module system application is yet-to-be-figured, it's idea is already floating somewhere in my brain. However, I'm thinking, if I'm able to first load all gather_data functions, then run sql and then run draw functions - and if I'm able to reuse results!
If, in example, $sql is SELECT * FROM POSTS LIMIT 10 and $sql2 is SELECT * FROM POSTS LIMIT 5, is it possible to program PHP to see: "ah, it's the same SQL, I'll call it just once and reuse the first 5 rows"?
Or is it possible to add this behavior to some DRM?
However, as tags say, this is still just an idea in progress. If it proves to be easy to accomplish, then I will post more question how :)
So, basically: Is it possible, does it make sense? If both are yes, then... any ideas how?
Don't get me wrong, that sounds like a plausible idea and you can probably get it running. But I wonder if it is really going to be beneficial. Will it cause a system to be faster? Give you more control? Make development easier?
I would just look into using (or building) a system using well practiced MVC style coding standards, build a good DB structure, and tweak the heck out of Apache (or use something like Lighttpd). You will have a lot more widespread acceptance of your code if you ever decide to make it open source, and if you ever need a hand with it another developer could step right in and pick up the keyboard.
Also, check out query caching in MySQL--you will see a similar (though not one-to-one) benefit from caching your query results server side with regard to your query example. Even better that is stored in server memory so PHP/MySQL overhead is dropped AND you don't have to code it.
All of that aside, I do think it is possible. =)
Generally speaking, such a cache system can generate significant time savings, but at the cost of memory and complexity. The more results you want to keep, the more memory it will take; and there's no guarantee that your results will ever be used again, particularly the larger result sets.
Second, there are certain queries that should never be cached, or that should be run again even if they're in the cache. For the most part, only SELECT and SHOW queries can be cached effectively, but you need to worry about invalidating them when you modify the underlying data. Even in the same pageview, you might find yourself working around your own cache system on occasion.
Third, this kind of problem has already been solved several times. First, consider turning on the MySQL query cache. Most of the time, it will speed things up a bit without requiring any code changes on your end. However, it's a bit aggressive about invalidating entries, so you could gain some performance at a higher level.
If you need another level, consider memcached. You'll have to store and invalidate entries manually, but it can store results across page views (where you'll really find the performance benefit), and will let unused entries expire before running out of memory.
I've seen several database cache engines, all of them are pretty dumb (i.e.: keep this query cached for X minutes) and require that you manually delete the whole cache repository after a INSERT / UPDATE / DELETE query has been executed.
About 2 or 3 years ago I developed an alternative DB cache system for a project I was working on, the idea was basically to use regular expressions to find the table(s) involved in a particular SQL query:
$query_patterns = array
(
'INSERT' => '/INTO\s+(\w+)\s+/i',
'SELECT' => '/FROM\s+((?:[\w]|,\s*)+)(?:\s+(?:[LEFT|RIGHT|OUTER|INNER|NATURAL|CROSS]\s*)*JOIN\s+((?:[\w]|,\s*)+)\s*)*/i',
'UPDATE' => '/UPDATE\s+(\w+)\s+SET/i',
'DELETE' => '/FROM\s+((?:[\w]|,\s*)+)/i',
'REPLACE' => '/INTO\s+(\w+)\s+/i',
'TRUNCATE' => '/TRUNCATE\s+(\w+)/i',
'LOAD' => '/INTO\s+TABLE\s+(\w+)/i',
);
I know that these regexs probably have some flaws (my regex skills were pretty green back then) and obviously don't match nested queries, but since I never use them that isn't a problem for me.
Anyway, after finding the involved tables I would alphabetically sort them and create a new folder in the cache repository with the following naming convention:
+table_a+table_b+table_c+table_...+
In case of a SELECT query, I would fetch the results from the database, serialize() them and store them in the appropriate cache folder, so for instance the results of the following query:
SELECT `table_a`.`title`, `table_b`.`description` FROM `table_a`, `table_b` WHERE `table_a`.`id` <= 10 ORDER BY `table_a`.`id` ASC;
Would be stored in:
/cache/+table_a+table_b+/079138e64d88039ab9cb2eab3b6bdb7b.md5
The MD5 being the query itself. Upon a consequent SELECT query the results would be trivial to fetch.
In case of any other type of write query (INSERT, REPLACE, UPDATE, DELETE and so on) I would glob() all the folders that had +matched_table(s)+ in their name all delete all the file contents. This way it wouldn't be necessary to delete the whole cache, just the cache used by the affected and related tables.
The system worked pretty well and the difference of performance was visible - although the project had many more read queries than write queries. Since then I started using transactions, FK CASCADE UPDATES / DELETES and never had the time to perfect the system to make it work with these features.
I've used MySQL Query Cache in the past but I must say the performance doesn't even compare.
I'm wondering: am I the only one who sees beauty in this system? Is there any bottlenecks I may not be aware of? Why do popular frameworks like CodeIgniter and Kohana (I'm not aware of Zend Framework) have such rudimentary DB cache systems?
More importantly, do you see this as a feature worth pursuing? If yes, is there anything I could do / use to make it even faster (my main concerns are disk I/O and (de)serialization of query results)?
I appreciate all input, thanks.
I can see the beauty in this solution, however, I belive it only works for a very specific set of applications. Scenarios where it is not applicable include:
Databases which utilize cascading deletes/updates or any kind of triggers. E.g., your DELETE to table A may cause a DELETE from table B. The regex will never catch this.
Accessing the database from points which do not go through you cache invalidation scheme, e.g. crontab scripts etc. If you ever decide to implement replication across machines (introduce read-only slaves), it may also disturb the cache (because it does not go through cache invalidation etc.)
Even if these scenarios are not realistic for your case it does still answer the question of why frameworks do not implement this kind of cache.
Regarding if this is worth pursuing, it all depends on your application. Maybe you care to supply more information?
The solution, as you describe it, is at risk for concurrency issues. When you're receiving hundreds of queries per second, you're bound to hit a case where an UPDATE statement runs, but before you can clear your cache, a SELECT reads from it, and gets stale data. Additionally, you may run in to issues when several UPDATEs hit the same set of rows in a short time period.
In a broader sense, best practice with caching is to cache the largest objects possible. E.g., rather than having a bunch of "user"-related rows cached all over the place, it's better to just cache the "user" object itself.
Even better, if you can cache whole pages (e.g., you show the same homepage to everyone; a profile page appears identical to almost everyone, etc.), that's even better. One cache fetch for a whole, pre-rendered page will dramatically outperform dozens of cache fetches for row/query level caches followed by re-rending the page.
Long story short: profile. If you take the time to do some measurement, you'll likely find that caching large objects, or even pages, rather than small queries used to build those things, is a huge performance win.
While I do see the beauty in this - especially for environments where resources are limited and can not easily be extended, like on shared hosting - I personally would fear complications in the future: What if somebody, newly hired and unaware of the caching mechanism, starts using nested queries? What if some external service starts updating the table, with the cache not noticing?
For a specialized, defined project that urgently needs a speedup that cannot be helped by adding processor power or RAM, this looks like a great solution. As a general component, I find it too shaky, and would fear subtle problems in the long run that stem from people forgetting that there is a cache to be aware of.
I suspect that the regexes may not provide for every case - certainly they don't seem to deal with the scenario of mixing base table names and the tables themselves. e.g. consider
update stats.measures set amount=50 where id=1;
and
use stats;
update measures set amount=50 where id=1;
Then there's PL/SQL.
Then there's the fact that it depends on every client opting in to an advisory control mechanism i.e. it pre-supposes that all the database access is from machines implementing the caching control mechanism on a shared filesystem.
(as a small point - wouldn't it be simpler to just check the modification times on the data files to determine if the cached version of a query on a defined set of tables is still current, rather then trying to identify if the cache control mechanism has spotted an update - it would certainly be a lot more robust)
Stepping back a bit, implementing this from scratch using a robust architecture would mean that all queries would have to be intercepted by the control mechanism. The control mechanism would probably need a more sophisticated query parser. It certainly requires a common storgae substrate for all the instances of the control mechanism. It probably needs an understanding of the data dictionary - all things which are already implemented by the database itself.
You state that "I've used MySQL Query Cache in the past but I must say the performance doesn't even compare."
I find this rather odd. Certainly when dealing with large result sets from queries, my experience is that loading the data into the heap from a database is a lot faster than unserializing large arrays - although large result sets are rather atypical of web based applications.
When I've tried to speed up database access (after fixing everything else of course) then I've gone down the route of replicating and partitioning data across multiple DBMS instances.
C.
This is related to the problem of session splitting when working with multiple databases in a master-slave configuration. Basically, a similar set of regular expressions are used to determine which tables (or even which rows) are being read from or written to. The system keeps track of which tables were written to and when, and when a read to one of those tables comes up, it's routed to the master. If a query is reading from a table whose data needn't be up-to-the-second accurate, then it's routed to the slave. Generally, information only really needs to be current when it's something a user changed themselves (i.e., editing a user's profile).
They talk about this a good bit in the O'Reilly book High Performance MySQL. I used it quite a bit when developing a system for handling session splits back in the day.
The improvement you describe is to avoid invalidating caches that are guaranteed to not have been affected by an update because they draw data from a different table.
That is of course nice, but I am not sure if it is fine-grained enough to make a real difference. You would still be invaliding lots of caches that did not really need to be (because the update was on the table, but on different rows).
Also, even this "simple" scheme relies on being able to detect the relevant tables by looking at the SQL query string. This can be difficult to do in the general case, because of views, table aliases, and multiple catalogs.
It is very difficult to automatically (and efficiently) detect whether a cache needs to be invalidated. Because of that, you can either use a very simple scheme (such as invalidating on every update, or per table, as in your system, which does not work too well when there are many updates), or a very hand-crafted cache for the specific application with deep hooks into the query logic (probably difficult to write and hard to maintain), or accept that the cache can contain stale data and just refresh it periodically.
Okay, so I'm sure plenty of you have built crazy database intensive pages...
I am building a page that I'd like to pull all sorts of unrelated database information from. Here are some sample different queries for this one page:
article content and info
IF the author is a registered user, their info
UPDATE the article's view counter
retrieve comments on the article
retrieve information for the authors of the comments
if the reader of the article is signed in, query for info on them
etc...
I know these are basically going to be pretty lightning quick, and that I could combine some; but I wanted to make sure that this isn't abnormal?
How many fairly normal and un-heavy queries would you limit yourself to on a page?
As many as needed, but not more.
Really: don't worry about optimization (right now). Build it first, measure performance second, and IFF there is a performance problem somewhere, then start with optimization.
Otherwise, you risk spending a lot of time on optimizing something that doesn't need optimization.
I've had pages with 50 queries on them without a problem. A fast query to a non-large (ie, fits in main memory) table can happen in 1 millisecond or less, so you can do quite a few of those.
If a page loads in less than 200 ms, you will have a snappy site. A big chunk of that is being used by latency between your server and the browser, so I like to aim for < 100ms of time spent on the server. Do as many queries as you want in that time period.
The big bottleneck is probably going to be the amount of time you have to spend on the project, so optimize for that first :) Optimize the code later, if you have to. That being said, if you are going to write any code related to this problem, write something that makes it obvious how long your queries are taking. That way you can at least find out you have a problem.
I don't think there is any one correct answer to this. I'd say as long as the queries are fast, and the page follows a logical flow, there shouldn't be any arbitrary cap imposed on them. I've seen pages fly with a dozen queries, and I've seen them crawl with one.
Every query requires a round-trip to your database server, so the cost of many queries grows larger with the latency to it.
If it runs on the same host there will still be a slight speed penalty, not only because a socket is between your application but also because the server has to parse your query, build the response, check access and whatever else overhead you got with SQL servers.
So in general it's better to have less queries.
You should try to do as much as possible in SQL, though: don't get stuff as input for some algorithm in your client language when the same algorithm could be implemented without hassle in SQL itself. This will not only reduce the number of your queries but also help a great deal in selecting only the rows you need.
Piskvor's answer still applies in any case.
Wordpress, for instance, can pull up to 30 queries a page. There are several things you can use to stop MySQL pull down - one of them being memchache - but right now and, as you say, if it will be straightforward just make sure all data you pull is properly indexed in MySQL and don't worry much about the number of queries.
If you're using a Framework (CodeIgniter for example) you can generally pull data for the page creation times and check whats pulling your site down.
As other have said, there is no single number. Whenever possible please use SQL for what it was built for and retrieve sets of data together.
Generally an indication that you may be doing something wrong is when you have a SQL inside a loop.
When possible Use joins to retrieve data that belongs together versus sending several statements.
Always try to make sure your statements retrieve exactly what you need with no extra fields/rows.
If you need the queries, you should just use them.
What I always try to do, is to have them executed all at once at the same place, so that there is no need for different parts (if they're separated...) of the page to make database connections. I figure it´s more efficient to store everything in variables than have every part of a page connect to the database.
In my experience, it is better to make two queries and post-process the results than to make one that takes ten times longer to run that you don't have to post-process. That said, it is also better to not repeat queries if you already have the result, and there are many different ways this can be achieved.
But all of that is oriented around performance optimization. So unless you really know what you're doing (hint: most people in this situation don't), just make the queries you need for the data you need and refactor it later.
I think that you should be limiting yourself to as few queries as possible. Try and combine queries to mutlitask and save time.
Premature optimisation is a problem like people have mentioned before, but that's where you're crapping up your code to make it run 'fast'. But people take this 'maxim' too far.
If you want to design with scalability in mind, just make sure whatever you do to load data is sufficiently abstracted and calls are centralized, this will make it easier when you need to implement a shared memory cache, as you'll only have to change a few things in a few places.