Creating a two-pass PHP cache system with mutable items - php

I want to implement a two-pass cache system:
The first pass generates a PHP file, with all of the common stuff (e.g. news items), hardcoded. The database then has a cache table to link these with the pages (eg "index.php page=1 style=default"), the database also stores an uptodate field, which if false causes the first pass to rerun the next time the page is viewed.
The second pass fills in the minor details, such as how long ago something(?) was, and mutable items like "You are logged in as...".
However I'm not sure on a efficient implementation, that supports both cached and non-cached (e.g., search) pages, without a lot of code and several queries.
Right now each time the page is loaded the PHP script is run regenerating the page. For pages like search this is fine, because most searches are different, but for other pages such as the index this is virtually the same for each hit, yet generates a large number of queries and is quite a long script.
The problem is some parts of the page do change on a per-user basis, such as the "You are logged in as..." section, so simply saving the generated pages would still result in 10,000's of nearly identical pages.
The main concern is with reducing the load on the server, since I'm on shared hosting and at this point can't afford to upgrade, but the site is using a sizeable portion of the servers CPU + putting a fair load on the MySQL server.
So basically minimising how much has to be done for each page request, and not regenerating stuff like the news items on the index all the time seems a good start, compared to say search which is a far less static page.
I actually considered hard coding the news items as plain HTML, but then that means maintaining them in several places (since they may be used for searches and the comments are on a page dedicated to that news item (i.e. news.php), etc).

I second Ken's rec of PEAR's Cache_Lite library, you can use it to easily cache either parts of pages or entire pages.
If you're running your own server(s), I'd strongly recommend memcached instead. It's much faster since it runs entirely in memory and is used extensively by a lot of high-volume sites. It's a very easy, stable, trouble-free daemon to run. In terms of your PHP code, you'd use it much the same way as Cache_Lite, to cache various page sections or full pages (or other arbitrary blobs of data), and it's very easy to use since PHP has a memcache interface built in.
For super high-traffic full-page caching, take a look at doing Varnish or Squid as a caching reverse proxy server. (Pages that get served by Varnish are going to come out easily 100x faster than anything that hits the PHP interpreter.)
Keep in mind with caching, you really only need to cache things that are being frequently accessed. Sometimes it can be a trap to develop a really sophisticated caching strategy when you don't really need it. For a page like your home page that's getting hit several times a second, you definitely want to optimize it for speed; for a page that gets maybe a few hits an hour, like a month-old blog post, it's a bad idea to cache it, you only waste your time and make things more complicated and bug-prone.

I recommend to don't reinvent the wheel... there are some template engines that support caching, like Smarty

For server side caching use something like Cache_Lite (and let someone else worry about file locking, expiry dates, file corruption)

You want to save the results to a file and use logic like this to pull them back out:
if filename exists
include filename
else
generate results
render to html (as string)
write to file
output string or include file
endif
To be clear, you don't need two passes because you can save parts of the page and leave the rest dynamic.

As always with this type of question, my response is:
Why do you need the caching?
Is your application consuming too much IO on your database?
What metrics have you run?
Your are talking about adding an extra level of complexity to your app so you need to be very sure that you actually need it.
You might actually benefit from using the built-in MySQL query cache, if the database is the contention point in your system. The other option is too use Memcache.

I would recommend using existing caching mechanism. Depending on what you really need, You might be looking for APC, memcached, various template caching libs... It easier/faster to tune written/tested code to please your need than to write everything from scratch. (usually, although there might be situations when you don't have a choisce)

Related

include selectively or globally?

With this question, I aim to understand the inner workings of PHP a little better.
Assume that you got a 50K library. The library is loaded with a bunch of handy functions that you use here and there. Also assume that these functions are needed/used by say 10% of the pages of your site. But your home page definitely needs it.
Now, the question is... should you use a global include that points to this library - across the board - so that ALL the pages ( including the 90% that do not need the library ) will get it, or should you selectively add that include reference only on the pages you need?
Before answering this question, let me point "why" I ask this question...
When you include that reference, PHP may be caching it. So the performance hit I worry may be one time deal, as opposed to every time. Once that one time out of the way, subsequent loads may not be as bad as one might think. That's all because of the smart caching mechanisms that PHP deploys - which I do not have a deep knowledge of, hence the question...
Since the front page needs that library anyway, the argument could be why not keep that library warm and fresh in the memory and get it served across the board?
When answering this question, please approach the matter strictly from a caching/performance point of view, not from a convenience point of view just to avoid that the discussion shifts to a programming style and the do's and don'ts.
Thank you
Measure it, then you know.
The caching benefit it likely marginal after the first hit, since the OS will cache it as well, but that save only the I/O hit (granted, this is not nothing). However, you will still incur the processing hit. If you include your 50K of code in to a "Hello World" page, you will still pay the CPU and memory penalty to load and parse that 50K of source, even if you do not execute any of it. That part of the processing will most likely not be cached in any way.
In general, CPU is extremely cheap today, so it may not be "worth saving". But that's why you need to actually measure it, so you can decide yourself.
The caching I think you are referring to would be opcode caching from something like APC? All that does is prevent PHP from needing to interpret the source each time. You still take some hit for each include or require you are using. One paradigm is to scrap the procedural functions and use classes loaded via __autoload(). That makes for a simple use-on-demand strategy with large apps. Also agree with Will that you should measure this if you are concerned. Premature optimization never helps.
I very much appreciate your concerns about performance.
The short answer is that, for best performance, I'd probably conditionally include the file on only the pages that need it.
PHP's opcode caches will maintain both include files in a cached form, so you don't have to worry about keeping the cache "warm" as you might when using other types of caches. The cache will remain until there are memory limitations (not an issue with your 50K script), the source file is updated, you manually clear the cache, or the server is restarted.
That said, opcode (PHP bytecode) caching is only one part of the PHP parsing process. Every time a script is run, the bytecode is then processed to build up the functions, classes, objects, and other instance variables that are defined and optionally used within the script. This all adds up.
In this case, a simple change can lead to significant improvement in performance. Be green, every cycle counts :)

How can I display currently online users, without a database? Or at least very efficiently. Preferably in PHP

This issue has been quite the brain teaser for me for a little while. Apologies if I write quite a lot, I just want to be clear on what I've already tried etc.
I will explain the idea of my problem as simply as possible, as the complexities are pretty irrelevant.
We may have up to 80-90 users on the site at any one time. They will likely all be accessing the same page, that I will call result.php. They will be accessing different results however via a get variable for the ID (result.php?ID=456). It is likely that less than 3 or 4 users will be on an individual record at any one time, and there are upwards of 10000 records.
I need to know, with less than a 20-25 second margin of error (this is very important), who is on that particular ID on that page, and update the page accordingly. Removing their name once they are no longer on the page, once again as soon as possible.
At the moment, I am using a jQuery script which calls a php file, reading from a database of "Currently Accessing" usernames who are accessing this particular ID, and only if the date at which they accessed it is within the last 25 seconds. The file will also remove all entries older than 5 minutes, to keep the table tidy.
This was alright with 20 or 30 users, but now that load has more than doubled, I am noticing this is a particularly slow method.
What other methods are available to me? Has anyone had any experience in a similar situation?
Everything we use at the moment is coded in PHP with a little jQuery. We are running on a server managed offsite by a hosting company, if that matters.
I have come across something called Comet or a Comet Server which sounds like it could potentially be of assistance, but it also sounds extremely complicated for my purposes and far beyond my understanding at the moment.
Look into websockets for a realtime socket connection. You could use websockets to push out updates in real time (instead of polling) to ensure changes in the 'currently online users' is sent within milliseconds.
What you want is an in-memory cache with a service layer that maintains the state of activity on the site. Using memcached might be a good starting point. Your pseudo-code would be something like:
On page access, make a call to CurrentUserService
CurrentUserService takes as a parameter the page you're accessing and who you are.
Each time you call it, it removes whatever you were accessing before from the cache.
Then it adds what you're currently accessing.
Then it compiles a list of who else is accessing the same thing based on the current state in the cache.
It returns this list, which your page processes and displays.
If you record when someone accesses a page, you can set a timeout for when the service stops 'counting' them as accessing the page.

How important is caching for a site's speed with PHP?

I've just made a user-content orientated website.
It is done in PHP, MySQL and jQuery's AJAX. At the moment there is only a dozen or so submissions and already I can feel it lagging slightly when it goes to a new page (therefore running a new MySQL query)
Is it most important for me to try and optimise my MySQL queries (by prepared statements) or is it worth in looking at CDN's (Amazon S3) and caching (much like the WordPress plugin WP Super Cache) static HTML files when there hasn't been new content submitted.
Which route is the most beneficial, for me as a developer, to take, ie. where am I better off concentrating my efforts to speed up the site?
Premature optimization is the root of all evil
-Donald Knuth
Optimize when you see issues, don't jump to conclusions and waste time optimizing what you think might be the issue.
Besides, I think you have more important things to work out on the site (like being able to cast multiple votes on the same question) before worrying about a caching layer.
Its done in PHP, MySQL and jQuery's AJAX, at the moment there is only a dozen or so submissions and already i can feel it lagging slightly when it goes to a new page (therefore running a new mysql query)
"Can feel it lagging slightly" – Don't feel it, know it. Run benchmarks and time your queries. Are you running queries effectively? Is the database setup with the right indexes and keys?
That being said...
CDN's
A CDN works great for serving static content. CSS, JavaScript, images, etc. This can speed up the loading of the page by minimizing the time it takes to request all the resources. It will not fix bad query practice.
Content Caching
The easiest way to implement content caching is with something like Varnish. Basically sits in front of your site and re-serves content that hasn't been updated. Minimally intrusive and easy to setup while being amazingly effective.
Database
Is it most important for me to try and optimise my MySQL queries (by prepared statements)
Why the hell aren't you already using prepared statements? If you're doing raw SQL queries always use prepared statements unless you absolutely trust the content in the queries. Given a user content based site I don't think you can safely say that. If you notice query times running high then take a look at the database schema, the queries you are running per-page, and the amount of content you have. With a few dozen entries you should not be noticing any issue even with the worst queries.
I checked out your site and it seems a bit sluggish to me as well, although it's not 100% clear it's the database.
A good first step here is to start on the outside and work your way in. So use something like Firebug (for Firefox), that - like similar plug-ins of its type - will allow you to break down where the time goes in loading a page.
http://getfirebug.com/
Second, per your comment above, do start using PreparedStatements where applicable; it can make a big difference.
Third, make sure your DB work is minimally complete - that means make sure you have indexes in the right place. It can be useful here to run the types of queries you get on your site and where the time goes. Explaining plans
http://dev.mysql.com/doc/refman/5.0/en/explain.html
and MySQL driver logging (if your driver supports it) can be helpful here.
If the site is still slow and you've narrowed it to use of the database, my suggestion is to do a simple optimization at first. Caching DB data, if feasible, is likely to give you a pretty big bang for the buck here. One very simple solution towards that end, especially given the stack you mention above, is to use Memcached:
http://memcached.org/
After injecting that into your stack, measure your performance + scalability and only pursue more advanced technologies if you really need to. I think you'll find that simple load balancing, caching, and a few instances of your service will go pretty far in addressing basic performance + scalability goals.
In parallel, I suggest coming up with a methodology to measure this more regularly and accurately. For example, decide how you will actually do automated latency measures and load testing, etc.
For me - optimising DB is on first place - because any caching can cause that when you find some problem , you need to rebuild all cache
There are several areas that can be optimized.
Server
CSS/JS/Images
PHP Code/Setup
mySQL Code/Setup
1st, I would use firefox, and the yslow tag, to evaluate your website's performance, and it will give server based suggestions.
Another solution, I have used is this addon.
http://aciddrop.com/php-speedy/
"PHP Speedy is a script that you can install on your web server to automatically speed up the download time of your web pages."
2nd, I would create a static domain name like static.yourdomainane.com, in a different folder, and move all your images, css, js there. Then point all your code to that domain, and then tweak your web server settings to cache all those files.
3rd, I would look at articles/techniques like this, http://www.catswhocode.com/blog/3-ways-to-compress-css-files-using-php to help compress/optimize your static files like css/js.
4th, review all your images, and their sizes, and make sure they are fully optimized. Or, convert to using css sprites.
http://www.smashingmagazine.com/2009/04/27/the-mystery-of-css-sprites-techniques-tools-and-tutorials/
http://css-tricks.com/css-sprites/
Basically for all your main site images, move them into 1 css sprite, then change your css, to refer to different spots on that sprite to display the image needed.
5th, Review your content pages, which pages, change frequently, and which ones rarely change, and those that rarely change, make those into static html pages. Those that change frequently, you can either leave as php pages, or create a cron or scheduled task using php command line to create new static html versions of the php page.
6th, for mySQL, I recommend you have the slow query log on, to help identify slow queries. Review your table structure, make sure they are optimal, and have tables, that are well designed. Use views and stored procedures, to move hard sql logic or functioning from php to mySQL.
I know this is a lot, but I hope it's useful.
It depends where your slowdowns really lie. You have a lot of twitter and facebook stuff on there that could easily slow your page down significantly.
Use firebug to see if anything is being downloaded during your perceived slow loading times. You can also download the YSlow firefox plugin to give you tips on speeding up page loads.
A significant portion of perceived slowness can be due to the javascript on the page rather than your back-end. With such a small site you should not see any performance issues on the back end until you have thousands of submissions.
Is it most important for me to try and optimise my MySQL queries (by prepared statements)
Sure.
But prepared statements has nothing to do with optimizations.
Nearly 99% of sites are running with no cache at all. So, I don't think you're really need it.
If your site is running slow, you have to profile it first and then optimise certain place that proven being a bottleneck.

Are there any performance issues with storing content on a database, instead of on a normal ASPX or PHP page?

Me and a colleague were discussing the best way to build a website last week. We both have different ideas about how to store content on the website. The way I have always approached this has been to store any sort of text or image link (not image file) on to a database. This way, if I needed to change a letter or a sentance I would just need to go on the database. I wouldn't have to touch the actual web page itself.
My colleague agreed with this to a point. He thinks that there are performance issues related to retrieving content from the database, especially if every character of content is coming from the database. When he builds a website, any content that won't be changed often (if at all) will be hard coded on to the page, and any content that would be changed or added regulary would come from the database.
I can't see the benefit of doing it like this, just from the perspective of everytime we make a change to an ASPX page we need to re-compile the site to upload it. So if a page has a misspelt "The" (so it'd be like "Teh") on one page, we have to change it on the page and then recompile the site (the whole site) and then upload it.
Likewise with my colleague, he thinks that if everything was to come from the database there would be performance issues with the site and the database, and that the overall loading speed of the web page to the browser would decrease.
What we were both left wondering was that if a website drew everything from the database (not HTML code as such, more like content for the headers, footers, links etc) would it slow down the website? And as well as this, if there is a performance issue, what would be better? A 100% database driven website with it's performance issues, or a website that contains hard coded content which would mean 10/20 minutes spent compiling and uploading a website just for the sake of a one word or letter change?
I'm interested to see if anyone else has heard of it, or if they have their own thoughts on this subject?
Cheers
Naturally it's a bit slower to retrieve information from a database rather than directly from the file system. But do you really care? If you design your application correctly then
a) you can implement caching so that the database is not hit for every page
b) the performance difference will be tiny anyway, particularly compared to the time to transmit the page from the server to the client
A 100% database approach opens up the potential for more flexibility and features in your application.
This is a classic case of putting caching / performance considerations before features / usability. Bottlenecks rarely occur where or when you expect them to - so focus on developing a powerful application and then implement caching later - when it's needed and where it's needed.
I'm not suggesting storing templates as static files is a bad idea - just that performance shouldn't be your primary driver in making these assessments. Static templates may be more secure or easier to edit using your development tools for example.
Hardcode the strings in the code (unless you plan to support multiple languages).
It is not worth the
extra code required for maintaining the strings
the added complexity
and possibly performance penalty
Would you extract the string "Cancel" from a button?
If so, would you be using the same string on multiple cancel buttons? Or one for each?
IF you decided to rename one button to "Cancel registration", how do you identify which "Cancel" to update in the database? You would be forced to set up a working process around how to deal with this, and in my opinion it's just not worth it.

First site going live real soon. Last minute questions

I am really close to finishing up on a project that I've been working on. I have done websites before, but never on my own and never a site that involved user generated data.
I have been reading up on things that should be considered before you go live and I have some questions.
1) Staging... (Deploying updates without affecting users). I'm not really sure what this would entail, since I'm sure that any type of update would affect users in some way. Does this mean some type of temporary downtime for every update? can somebody please explain this and a solution to this as well.
2) Limits... I'm using the Kohana framework and I'm using the Auth module for logging users in. I was wondering if this already has some type of limit (on login attempts) built in, and if not, what would be the best way to implement this. (save attempts in database, cookie, etc.). If this is not whats meant by limits, can somebody elaborate.
Edit: I think a good way to do this would be to freeze logging in for a period of time (say 15 minutes), or displaying a captcha after a handful (10 or so) of unseccesful login attempts
3) Caching... Like I said, this is my first site built around user content. Considering that, should I cache it?
4) Back Ups... How often should I backup my (MySQL) database, and how should I back it up (MySQL export?).
The site is currently up, yet not finished, if anybody wants to look at it and see if something pops out to you that should be looked at/fixed. Clashing Thoughts.
If there is anything else I overlooked, thats not already in the list linked to above, please let me know.
Edit: If anybody has any advice as to getting the word out (marketing), i'd appreciate that too.
Thanks.
EDIT: I've made the changes, and the site is now live.
1) Most sites who incorporate frequent updates or when their is a massive update that will take some time use a beta domain such as beta.example.com that is restricted to staff until it is released to the main site for the public.
2) If you use cookies then they can just disable cookies and have infinite login attempts, so your efforts will go to waste. So yeah, use the database instead. How you want it to keep track is up to you.
3) Depends on what type of content it is and how much there is. If you have a lot of different variables, you should only keep the key variables that recognize the data in the database and keep all the additional data in a cache so that database queries will run faster. You will be able to quickly find the results you want and then just open the cache file associated with them.
4) It's up to you, it really depends on traffic. If you're only getting 2 or 3 new pieces of data per day, you probably don't want to waste the time and space backing it up every day. P.S. MySQL exports work just fine, I find them much easier to import and work with.
1) You will want to keep taking your site down for updates to a minimum. I tend to let jobs build up, and then do a big update at the end of the month.
2) In terms of limiting login attempts; Cookies will be simple to implement but is not fool-proof, it will prevent the majority of your users but it can be easily circumvented so it would be best to choose another way. Using a database would be better but a bit more complicated to implement and could add more strain to a database.
3) Cacheing depends greatly on how often content is updated or changes. If content is changing a lot it may not be worth caching data but if a lot of more static then maybe using something like memcache or APC will be of use.
4) You should always make regular backups. I do one daily via a cron job to my home server although a weekly one would suffice.
Side notes: YSlow indicates that:
you are not serving up expires headers on your CSS or images (causes pages to load slower, and costs you more bandwidth)
you have CSS files that are not served up with gzip compression (same issues)
also consider moving your static content (CSS,Images,etc.) to a separate domain (CDN) for faster load times

Categories