First of all, the website I run is hosted and I don't have access to be able to install anything interesting like memcached.
I have several web pages displaying HTML tables. The data for these HTML tables are generated using expensive and complex MySQL queries. I've optimized the queries as far as I can, and put indexes in place to improve performance. The problem is if I have high traffic to my site the MySQL server gets hammered, and struggles.
Interestingly - the data within the MySQL tables doesn't change very often. In fact it changes only after a certain 'event' that takes place every few weeks.
So what I have done now is this:
Save the HTML table once generated to a file
When the URL is accessed check the saved file if it exists
If the file is older than 1hr, run the query and save a new file, if not output the file
This ensures that for the vast majority of requests the page loads very fast, and the data can at most be 1hr old. For my purpose this isn't too bad.
What I would really like is to guarantee that if any data changes in the database, the cache file is deleted. This could be done by finding all scripts that do any change queries on the table and adding code to remove the cache file, but it's flimsy as all future changes need to also take care of this mechanism.
Is there an elegant way to do this?
I don't have anything but vanilla PHP and MySQL (recent versions) - I'd like to play with memcached, but I can't.
Ok - serious answer.
If you have any sort of database abstraction layer (hopefully you will), you could maintain a field in the database for the last time anything was updated, and manage that from a single point in your abstraction layer.
e.g. (pseudocode): On any update set last_updated.value = Time.now()
Then compare this to the time of the cached file at runtime to see if you need to re-query.
If you don't have an abstraction layer, create a wrapper function to any SQL update call that does this, and always use the wrapper function for any future functionality.
There are only two hard things in
Computer Science: cache invalidation
and naming things.
—Phil Karlton
Sorry, doesn't help much, but it is sooooo true.
You have most of the ends covered, but a last_modified field and cron job might help.
There's no way of deleting files from MySQL, Postgres would give you that facility, but MySQL can't.
You can cache your output to a string using PHP's output buffering functions. Google it and you'll find a nice collection of websites explaining how this is done.
I'm wondering however, how do you know that the data expires after an hour? Or are you assuming the data wont change that dramatically in 60 minutes to warrant constant page generation?
Related
I've got a heavy-read website associated to a MySQL database. I also have some little "auxiliary" information (fits in an array of 30-40 elements as of now), hierarchically organized and yet gets periodically and slowly updated 4-5 times per year. It's not a configuration file though since this information is about the subject of the website and not about its functioning, but still kind of a configuration file. Until now, I just used a static PHP file containing an array of info, but now I need a way to update it via a backend CMS from my admin panel.
I thought of a simple CMS that allows the admin to create/edit/delete entries, periodical rare job, and then creates a static JSON file to be used by the page building scripts instead of pulling this information from the db.
The question is: given the heavy-read nature of the website, is it better to read a rarely updated JSON file on the server when building pages or just retrieve raw info from the database for every request?
I just used a static PHP
This sounds like contradiction to me. Either static, or PHP.
given the heavy-read nature of the website, is it better to read a rarely updated JSON file on the server when building pages or just retrieve raw info from the database for every request?
Cache was invented for a reason :) Same with your case - it all depends on how often data changes vs how often is read. If data changes once a day and remains static for 100k downloads during the day, then not caching it or not serving from flat file would would simply be stupid. If data changes once a day and you have 20 reads per day average, then perhaps returning the data from code on each request would be less stupid, but from other hand, all these 19 requests could be served from cache anyway, so... If you can, serve from flat file.
Caching is your best option, Redis or Memcached are common excellent choices. For flat-file or database, it's hard to know because the SQL schema you're using, (as in, how many columns, what are the datatype definitions, how many foreign keys and indexes, etc.) you are using.
SQL is about relational data, if you have non-relational data, you don't really have a reason to use SQL. Most people are now switching to NoSQL databases to handle this since modifying SQL databases after the fact is a huge pain.
I have about 10 tables with ~10,000 rows each which need to be pulled very often.
For example, list of countries, list of all schools in the world, etc.
PHP can't persist this stuff in memory (to my knowledge) so I would have to query the server for a SELECT * FROM TABLE every time. Should I use memcached here? At first though it's a clear absolutely yes, but at second thought, wouldn't mysql already be caching for me and this would be almost redundant?
I don't have too much understanding of how mysql caches data (or if it even does cache entire tables).
You could use MySQL query cache, but then you are still using DB resources to establish the connection and execute the query. Another option is opcode caching if your pages are relatively static. However I think memcached is the most flexible solution. For example if you have a list of countries which need to be accessed from various code-points within your application, you could pull the data from the persistent store (mysql), and store them into memcached. Then the data is available to any part of your application (including batch processes and cronjobs) for any business requirement.
I'd suggest reading up on the MySQL query cache:
http://dev.mysql.com/doc/refman/5.6/en/query-cache.html
You do need some kind of a cache here, certainly; layers of caching within and surrounding the database are considerably less efficient than what memcached can provide.
That said, if you're jumping to the conclusion that the Right Thing is to cache the query itself, rather than to cache the content you're generating based on the query, I think you're jumping to conclusions -- more analysis is needed.
What data, other than the content of these queries, is used during output generation? Would a page cache or page fragment cache (or caching reverse-proxy in front) make more sense? Is it really necessary to run these queries "often"? How frequently does the underlying data change? Do you have any kind of a notification event when that happens?
Also, SELECT * queries without a WHERE clause are a "code smell" (indicating that something probably is being done the Wrong Way), especially if not all of the data pulled is directly displayed to the user.
I am designing a web application which will be doing three things:
1) store some data
2) make user available to view these data
3) from time to time add/remove/change some data
Looks pretty simple, but I would like to minimase usage of server resources by avoiding MySQL and PHP. My main goal is to deliver HTML file for user - posts1.html (posts2.html, posts3.html... (where 1,2,3 are numbers of pages of data)).
Normally, I would create posts.php file, which would send query to database, but my data are changing only three-five times a day, so it would be a huge waste.
Instead, I thought about caching these data, what would spare a lot of server resources, but in this situation there would be some of PHP code involved.
My another idea is to create script that would be creating all HTML files after every change in database and then replace the old ones with them. But what if someone requests page that is replacing right now? It may cause errors, user can get the uncompleted file etc.
However, there is one solution - I could store created HTML files in two directories (A and B) and using .htaccess do something like this (pseudocode):
if ( (HOURS)%2 == 0 )
/postsX.html -> /A/postsX.html
else
/postsX.html -> /B/postsX.html
It would give me enough time to upgrade all files.
I would love to hear what do you think about it and what would you do?
If you dont want to use a full blown MySQL server, use SQLite. It's part of PHP and very lightweight. Then add caching where appropriate. Your other approaches sound like a waste of time to me. Too much effort for too little gain. SQLite and caching is tried and tested.
Besides, you should not worry about waste of resources unless you are running short on them. Your application doesnt sound like it needs scaling at this point. So build the simplest thing that will work.
If you have to have that static pages approach, then put all those files into a symlinked folder. Create a script that generates the static pages into a new folder (either via cron or manual trigger) and then changes the symlink from the old folder to the new folder. This way you don't have to worry about people hitting your site while its generating content.
you should use SQLite with ADODB or any other supported database and implement caching. See the ADODB compatibility list http://phplens.com/lens/adodb/docs-adodb.htm#drivers. The caching feature is really powerful, ADODB is very famous and well documented.
I have a website that let's each user create a webpage (to advertise his product). Once the page is created it will never be modified again.
Now, my question: Is it better to keep the page content (only a few parts are editable) into a MySql database and generate it using queries everytime the page is accesed or to create a static webpage containing all the info and store it onto the server?
If I store every page on the disk, I may reach like 200.000 files.
If I store each page in MySQL database I would have to make a query each time the page is requested, and for like 200.000 entries and 5-6 queries/second I think the website will be slow...
So what's better?
MySQL will be able to handle the load if you create the tables properly (normalized and indexed). But if the content of the page doesn't change after creation, it's better if you cache the page statically. You can organize the files into buckets (folders) so that one folder doesn't have too many files in it.
Remember to cache only the content areas and not the templates. Unless each user has complete control over how his/her page shows up.
200.000 files writable by the Apache process is not a good idea.
I recommend using a database.
Database imports/exports are easier, not telling about the difference between the maintenance costs.
Databases are using caching, and if nothing is changed, they will pull up the last result, without running the query again. This doesn't stand, thanks JohnP.
If you want to redesign your webpage sometimes later you must be using MySQL to store the pages as you can't really change them (unless you dig into regexp) after making them static.
About the time issue - its not an issue if you set indexes right.
if the data is small to moderate then prefer static hardcoding ie. putting the data in the HTML, but if it is huge, computational or dynamic and changing you have no option but to use a connectivity to the Database
I believe that proper caching technique with certain attributes (long exp. time) would be better than static pages or retrieving everything from mysql everytime.
Static content is usually a good thing if you have a lot of traffic, but 5-6 queries a second is not hard for the database at all, so with your current load it doesn't matter.
You can spread the static files to different directories by file name and set up rewrite rules in your web server (mod_rewrite on Apache, basic location matching with regexp on Nginx and similar on other web servers). That way you won't even have to invoke the PHP interpreter.
A database and proper caching. 200.000 pages times, what? 5KB? That's 1 GB. Easy to keep in RAM. Besides 5/6 queries per second is easy on a database. Program first, then benchmark.
// insert quip about premature optimisation
I don't really have any experience with caching at all, so this may seem like a stupid question, but how do you know when to cache your data? I wasn't even able to find one site that talked about this, but it may just be my searching skills or maybe too many variables to consider?
I will most likely be using APC. Does anyone have any examples of what would be the least amount of data you would need in order to cache it? For example, let's say you have an array with 100 items and you use a foreach loop on it and perform some simple array manipulation, should you cache the result? How about if it had a 1000 items, 10000 items, etc.?
Should you be caching the results of your database query? What kind of queries should you be caching? I assume a simple select and maybe a couple joins statement to a mysql db doesn't need caching, or does it? Assuming the mysql query cache is turned on, does that mean you don't need to cache in the application layer, or should you still do it?
If you instantiate an object, should you cache it? How to determine whether it should be cached or not? So a general guide on what to cache would be nice, examples would also be really helpful, thanks.
When you're looking at caching data that has been read from the database in APC/memcache/WinCache/redis/etc, you should be aware that it will not be updated when the database is updated unless you explicitly code to keep the database and cache in synch. Therefore, caching is most effective when the data from the database doesn't change often, but also requires a more complex and/or expensive query to retrieve that data from the database (otherwise, you may as well read it from the database when you need it)... so expensive join queries that return the same data records whenever they're run are prime candidates.
And always test to see if queries are faster read from the database than from cache. Correct database indexing can vastly improve database access times, especially as most databases maintain their own internal cache as well, so don't use APC or equivalent to cache data unless the database overheads justify it.
You also need to be aware of space usage in the cache. Most caches are a fixed size and you don't want to overfill them... so don't use them to store large volumes of data. Use the apc.php script available with APC to monitor cache usage (though make sure that it's not publicly accessible to anybody and everybody that accesses your site.... bad security).
When holding objects in cache, the object will be serialized() when it's stored, and unserialized() when it's retrieved, so there is an overhead. Objects with resource attributes will lose that resource; so don't store your database access objects.
It's sensible only to use cache to store information that is accessed by many/all users, rather than user-specific data. For user session information, stick with normal PHP sessions.
The simple answer is that you cache data when things get slow. Obviously for any medium to large sized application, you need to do much more planning than just a wait and see approach. But for the vast majority of websites out there, the question to ask yourself is "Are you happy with the load time". Of course if you are obsessive about load time, like myself, you are going to want to try to make it even faster regardless.
Next, you have to identify what specifically is the cause of the slowness. You assumed that your application code was the source but its worth examining if there are other external factors such as large page file size, excessive requests, no gzip, etc. Use a site like http://tools.pingdom.com/ or an extension like yslow as a start for that. (quick tip make sure keepalives and gzip are working).
Assuming the problem is the duration of execution of your application code, you are going to want to profile your code with something like xdebug (http://www.xdebug.org/) and view the output with kcachegrind or wincachegrind. That will let you know what parts of your code are taking long to run. From there you will make decisions on what to cache and how to cache it (or make improvements in the logic of your code).
There are so many possibilities for what the problem could be and the associated solutions, that it is not worth me guessing. So, once you identify the problem you may want to post a new question related to solving that specific problem. I will say that if not used properly, the mysql query cache can be counter productive. Also, I generally avoid the APC user cache in favor of memcached.