Memcache to deal with high latency web services APIs - good idea?

Memcache to deal with high latency web services APIs - good idea? - php

I have a PHP application that calls web services APIs to get some objects before rendering a web page that incorporates those objects. In some cases these APIs are really slow (seconds) and that is not acceptable from a user experience point of view. Two things I know I can do...
Use ajax and make these calls in the background
Time out the call and degrade gracefully if it is taking too long
Neither is ideal, so I was thinking about using memcache (the PHP extension for memcached) to cache the object that I get from the 3rd party web service. The objects will be loaded many times by different users loading the same page, so this seems to make sense.
The objects are relatively small (~1k).
Does this sound like a reasonable approach? I know memcached was originally designed to alleviate database load, so I'm wondering whether there is a gotcha somewhere that I'm not seeing.
Thanks.

This is a perfectly legitimate use of memcache. It is not only for database load reduction, it is for caching and object storage in general. :)
Also note, PHP has two interfaces for memcached. Confusingly, they are named "memcache" and "memcached". Read these to pick between the two:
https://serverfault.com/questions/63383/memcache-vs-memcached
http://code.google.com/p/memcached/wiki/PHPClientComparison

I'd highly recommend memcache for this situation as it will:
Reduce DNS calls.
Reduce page latency.
Reduce bandwidth usage.
Your only real task is to determine how often the data you are dealing with will be changing. This will help you to optimize your expiry time for the cache key(s).

This approach may not work for you in your situation, but you might use cron jobs to call a PHP script that loads the required information then caches it to a more speedy data source (XML or Database).
This may not work if the information is updated really often or if there is a lot of different data that needs to be loaded, but it is an option. I've used this approach for other tasks that take a lot of time to complete and have found it to be a reasonable solution.

Related

Linux server: Would a cache scheme help reduce hits to 3rd-party server?

I have a situation where my Linux server will be running a website which gets some of its data from a 3rd-party server through a SOAP interface. The data isn't exactly real-time, but it does change every 5 minutes or so. I was told not to have our website hammer their website for data, which I can completely understand.
So I wondered if this was a good candiate to use a cache scheme of some type. Where when a user comes to our web page to display the data, if it's less than 5 minutes old (for example), it would get that data from our server instead of polling the 3rd-party for it. This way, if 100 users at once come to our website, our server won't be access the 3rd-party website 100 times to share the same exact data within a given time-frame.
Is this a practical thing to do in PHP? Or should this be written in a faster language when it comes to caching? Are their cache packages for this sort of situation which can be used along with a PHP Joomla application? Thanks!

I think memcached is a good choice.
You can set timeout when you store content to memcached server, if key-value missed, retrieve data from 3rd-part server and store again.
There is memcached extension for PHP, check doc here.

There's lots of ways to solve the problem -we can't say which is the right one without knowing a lot more about the constraints you are working in or how the service is used. If you are using Joomla then you're obviously not bothered about performance - it would be really hard to write anything which has a measurable impact on your html generation times. This does not need to "be written in a faster language", but....
can you install additional software?
have you got access to cron?
at what rate is the service consumed?
how many webservers do you have consuming the service - do they have a shared filesystem? Are they on the same sub-net?
Is the SOAP response cacheable?
how do you deal with non-availability of the service?
For a very scalable solution I would suggest running a simple forward proxy (e.g. squid) but do make sure that it's not accessible from the internet. Sven (see comment elsewhere) is right about POST sometimes not being cacheable - but you can cache the response from a surrogate script on your own site accessed via GET returning appropriate caching instructions - and this could return the data as a serialized php array / object which is much less expensive to process. Indeed whichever method you choose I would recommend caching the parsed response - not the XML. This also allows you to override poor caching information from the service.
If the rate is less than around 1 per minute then the cron solution is overkill. But if its more than 20 per minute then it makes a lot of sense. If you don't have access to cron / can't install your own software then you might consider simply caching the response and refreshing the cache on demand. Don't bother with memcache unless you are already using it. APC is faster on a single server - but memcache is distributed. If you have multiple servers then use whatever cluster storage you are currently sharing your data in (distributed filesystem / database cluster / shared filesystem....).
Don't try to use locking / mutexes around the cache refresh unless you really have to (i.e. only if accessing the service more than once every 5 minutes is a mortal sin) - this gets real complicated real quick - it's too easy to introduce bugs.
Do make sure you buffer and validate any responses before writing them to the cache.

Yes, just use HTTP. Most of the heavy lifting has already been built into your web server.
Since SOAP is just a simple HTTP POST request with an XML body, you could set up your website or HTTP API in front of the SOAP endpoint to act like a translator to regular HTTP, attaching the appropriate HTTP caching headers on the transformed response body and then configure an NGinx reverse proxy in front of it.
Notably: if the transformation is simple you could just use XSLT to transform the response body from the SOAP API and remove the web service layer entirely.

Your problem is a very small one, which does not require a complicated solution.
You could write a small cron job that is executed every five minutes, sends the request to the SOAP server, and stores the result in a local file. If any script needs the data, it reads the local file. This will result in 288 requests to the SOAP server per day, and have excellent performance for any script call that needs the results because they are already on your server.
If you do not have cron jobs available and cannot fake them, any other cache will do. You really don't need fancy stuff like Memcached, unless it already is available. Storing the result to a cache file will work as well. Note that if you have to really fetch the SOAP result from the origin, this will take some more time and might affect the perceived performance of your site.
There are plenty of frameworks which also offer cache support, and if you use one you should investigate if there is support included. I'm not sure if Joomla has something appropriate for you. Otherwise, you can implement something yourself. It isn't that hard.

Cache functionality comes in various flavours:
memory-based, where a separate process on the server holds data in RAM (or overflows to disk) and you query it like you would a database; very efficient and powerful, and will have options to manage storage use and clear up after themselves, but requires setting up additional software on the server; e.g. memcached, redis
file-based, where you just write the data to disk; less efficient, but can be implemented in "user-land" code, i.e. pure PHP; beware of filling up your disk with variant caches that have expired but not been cleaned up; many frameworks have an implementation of this built in
database-backed, where you push data into an RDBMS (e.g. MySQL, PostgreSQL) or fully-featured NoSQL store (e.g. MongoDB); might make sense if you have a large amount of data, and can trade a bit of performance; as with files, you need to make sure that stale data is cleaned up
In each case, the basic idea is that you create a "key" that can tell one request from another (e.g. the name of the SOAP call and its input parameters, serialized), and pick a "lifetime" (how long you want to carry on using the same copy of the data). The caching engine or library then checks for a cache with that key, and if it is still within its "lifetime" returns the previously cached data. If there is a "cache miss" (there is no cache for that key, or it has expired), you perform the costly operation (in your case, the SOAP call) and save to the cache, using the same key.
You can do more complex things, like pre-caching things in the background so that there is never a cache miss, or having some code paths which accept stale data in order to return quickly, but these can generally be implemented on top of whatever you're using as the main caching solution.
Edit Another important decision is at what level of granularity to cache the data, in relation to processing it. At one extreme, you could cache each individual SOAP call: simple to set up, but means re-processing the same data repeatedly, and can cause problems if two responses are related, but cached independently and may get out of sync. At the other extreme, you can cache whole rendered pages: pages load very fast once cached, but creating variations based on the same data without repeating work becomes tricky. In between are various points in your code where you have processed and combined data into meaningful chunks: if your application is well-written, these are the input and output of major functions, or possibly even complete model objects; this is more work to implement, as you have to choose the right keys (avoiding two contexts overwriting each other's caches while ignoring variables that have no impact on the data in question) and values (avoiding repeats of costly work without having to store huge blobs of data which will be slow to unserialize and use up the capacity of your cache store). As with anything else, no approach suits all needs, and a complex application will probably involve caching at multiple levels for different purposes.

Redis with PHP - implementing data caching

I have installed redis on my server and implemented object caching for data returned within a PHP based web application. The php model essentially executes a reasonably complex query and returns a detailed array of data. I have tested the caching and found everything to be working as expected. I first check to see if the key exists in redis. If it does, redis returns the data, the model unserializes and returns the previously cached data. If the cache has expired, the model executes the sql query, returns the data and sets the key and serialized value in redis.
So here are my questions.
I'm not sure how to really benchmark this as it is all browser based. What tools are there out there that would allow me to get a reasonable benchmark to compare caching and not caching. I'm thinking of perhaps a php script that calls the api 1000 times via curl.
I implemented this in redis because I once read that caching with redis will work across multiple sessions or ip addresses accessing the site. For example, if the api is accessed 1000 time an hour by multiple ip addresses/users, I am assuming this approach will reduce the load on the mysql server and let redis do the work of returning the cached data instead. Can anyone shed some light on this? Are my assumptions valid?
All comments are welcome!
Thanks!
Dave

To benchmark the web site, I would use something like Siege rather than writing a specific PHP script.
Regarding Redis usage, caching things in in-memory stores like memcached or Redis is now very common. Both memcached and Redis are suitable for this purpose, although for pure caching, memcached is arguably easier to setup. 1000 times an hour represents only 3.6 TPS - any data store (including MySQL) will support such traffic without any issue. Now, multiply this traffic by 100 or 1000, and the caching layer (memcached or Redis) will become mandatory to protect your database.
To use Redis for caching, you may want to check the EXPIRE command and have a look at the maxmemory-policy parameter in the configuration file.

I have done extensive testing of cache backends for the Zend_Cache library. The tests were done using multiple php-cli processes and randomized data and considered read performance, write performance and cache tag cleaning performance. If testing just the cache backend the web server performance is not relevant so I recommend testing via CLI to simplify the testing. Also, testing with only one process will not give you an accurate picture of a backend's characteristics under heavy load.
MySQL is very fast itself and if you are doing single-record indexed queries then MySQL's own query cache will be very fast. I'd only recommend adding an additional caching layer for things that are slow (aggregated results of multiple queries or generating chunks of HTML). You can use Zend_Cache without including the entire Zend framework so I highly recommend you check out both Cm_Cache_Backend_Redis and Cm_Cache_Backend_File.

Speedup a php web site

What's the best way (ways?) to speed up a php web site and how much faster it can using this or that way?

PHP isn't really the kind of language where you can do micro-optimizations, or just work on the code alone. There's really no point. Although PHP isn't particularly fast, PHP itself is rarely the bottleneck in a given web site.
You need to work out where that bottleneck is before you can fix it. There are a lot of common bottlenecks, with common solutions. It's difficult to generalize, given so few details, but there are a lot of performance hints that apply to most web sites.
The first good place to look is actually on the client side, rather than the server side. How large are your pages (including images, CSS, JavaScript and the like)? How many HTTP requests does a single page view require? Use something like Firebug (and the YSlow add-on for Firebug) to see how long your page actually takes to load, and which bits of your page cause the problem. Some general hints:
Work out ways to shrink the CSS and JavaScript - remove anything you don't need, and run the rest through a tool like YUI Compressor.
If you have multiple CSS and JavaScript files, try to combine them into a single file.
Optimize all of your images as much as possible, and see if you can combine any of those into a single file using CSS sprites or similar. PunyPNG is good for lossless images. A decent JPEG encoder (NOT Photoshop) is good for photos.
Move the CSS to the top of the page, and the JavaScript to the bottom, so the browser can render the page before the JavaScript has finished downloading.
Make sure that all of your CSS, JavaScript and HTML are being served compressed.
Make sure that you're using appropriate caching - if a file hasn't changed, there's no point in re-downloading it.
Once you've got the client side out of the way, you might have to turn your attention to the server side.
Install an opcode cache, like APC, XCache, or Zend Optimizer. It's very easy to do, and will always provide some improvement. Once you've done that, profile your pages, to find out where the time is actually being spent.
More likely than not, you'll be spending most of your time waiting for the database to return results. So, at a bare minimum:
Work out which queries are taking the longest, and work on them first. Use your head though - a query that takes five seconds on an admin page that nobody looks at is not as important as a query that takes one second on the front page.
Make sure that your query uses appropriate indexes. No common query should ever need to do a full table scan. Certain kinds of sorting or grouping may be unable to use indexes - try to avoid them, or modify the query so that it can use indexes.
Make sure that your queries aren't using temporary tables.
Use the EXPLAIN keyword - it's very useful.
Tune the database server itself. MySQL is generally not optimized for performance.
Once you've done that, it's usually best to start working out how to use caching. The best way to speed PHP code up is to reduce the amount of work it has to do.
Make sure your database's query cache is working properly.
Use something like Memcached to store frequently used results, instead of getting them from the database.
If you have enough memory, try to keep everything in Memcached, resorting to the database only when something isn't present in the cache.
If you have chunks of pages that are dynamic, but the same for all users, try caching those chunks. For example, if two users are looking at an article, the article itself is going to be exactly the same for each user, even if the rest of the page isn't. Generate the HTML for the article, and chuck it in the cache.
If you have lots of non-authenticated users, it's entirely possible that they'll all be seeing the exact same page. Two non-authenticated users looking at the above article won't just see an identical article - they'll see an identical page, right down to the login links. Set your PHP scripts up so you can use HTTP caching headers (check the last modified date, and return a 304 Not Modified if it's not been changed). Once you've done that, stick a Squid reverse-proxy in front of the webserver, and let Squid serve pages out of it's cache.
After that point, the general approach is to start using more servers, and the problem becomes one of scaling, rather than raw speed. The general plan is to make sure that your website has a shared-nothing architecture - all persistent data is stored in the database. Then, you install multiple webservers, move the database server to a separate machine, and run the entire thing behind a caching reverse proxy. To add more capacity, you add more machines.

One way: php accelerators, e.g. APC.
Another; read blog articles, e.g. performance tuning overview.

A general question i would say. Try looking for optimazation tips online...
Several parameters are involved:
I/O access (using it a lot - file_exists, is_file overheads)
Database access (optimize queries, use stored procedures, check your db cache)
Using an opcode cache (like APC)
Compressing output
Serving js/css minified and compressed (and using subdomains to deliver them to the browser)
Using memcache to cache data into memory for faster access
You can use benchmarking tools to test your environment before and after the optimizations.
Try apache bench for example.

Filesize.
A file of 500 KB takes longer to download then a file of 300 KB. So optimize and crop as much as you can.
Accelators
Self explainable: List of PHP accelerators
Server upgrade
Though this costs money, when dealing with a lot of traffic, it will have impact on how fast the .php files gets processes and how fast data will be send to the user.
I don't recommend this though since there are other (free) ways to improve speed.
Don't user external resources
If you are linking some images trough other sites, the speed of the downloading will not be in your control. Instead, if you plan on using images from others download them to your own server first (or upload them to your own provider) and load them that way.
Review and improve your code
Find short cuts, remove unnecessary code, delete unused variables, reuse others etc.
There are other ways but I believe the above information has the most impact on your speed

You should probably do some search for existing answers to this question, however...
APC for opcode caching
Memcached for object storing (to reduce the number of database queries)
Check for / optimize slow SQL queries
Measure and find bottlenecks
Don't rely on (slow) web services on each page load, etc.

Yahoo has got some good basic advice on speeding up web pages, much of it very easy to implement. You may also want to download yslow + firebug for firefox; they will help indicate possible basic bottlenecks from a client request perspective.
The rest of the advice here is good, so I wont add much else other than; don't bother optimising any code until you're 100% sure that you've found a bottleneck. I can't stress that enough. Don't waste time tweaking code or implementing new things (ie caching) because you "feel" will make things quicker, act only on real evidence (ie performance profiling).

File access speed vs database access speed

The site I am developing in php makes many MySQL database requests per page viewed. Albeit many are small requests with properly designed index's. I do not know if it will be worth while to develop a cache script for these pages.
Are file I/O generally faster than database requests? Does this depend on the server? Is there a way to test how many of each your server can handle?
One of the pages checks the database for a filename, then checks the server to see if it exists, then decides what to display. This I would assume would benefit from a cached page view?
Also if there is any other information on this topic that you could forward me to that would be greatly appreciated.

If you're doing read-heavy access (looking up filenames, etc) you might benefit from memcached. You could store the "hottest" (most recently created, recently used, depending on your app) data in memory, then only query the DB (and possibly files) when the cache misses. Memory access is far, far faster than database or files.
If you need write-heavy access, a database is the way to go. If you're using MySQL, use InnoDB tables, or another engine that supports row-level locking. That will avoid people blocking while someone else writes (or worse, writing anyway).
But ultimately, it depends on the data.

It depends on how the data is structured, how much there is and how often it changes.
If you've got relatively small amounts, of relatively static data with relatively simple relationships - then flat files are the right tool for the job.
Relational databases come into their own when the connections between the data are more complex. For basic 'look up tables' they can be a bit overkill.
But, if the data is constantly changing, then it can be easier to just use a database rather than handle the configuration management by hand - and for large amounts of data, with flat files you've got the additional problem of how do you find the one bit that you need, efficiently.

This really depends on many factors. If you have a fast database with much data cached in the RAM or a fast RAID system, chances are bad, that you will gain much from simple file system caching on the web server. Also think about scalibility. Under high workload a simple caching mechanism might easily become a bottle neck while a database is well designed to handle high work loads.
If there are not so much requests and you (or the operating system) is able to keep the cache in RAM, you might be able to gain some performance. But now the question arises, if it is realy neccessary to perform caching under low work load.

From plain performance perspective, it is wiser to tune the database server and not complicate the data access logic with intermediate file caches. A good database server would do the caching on its own if the results are cacheable. (I'm not sure what is teh case with mysql).
If you have performance problems, you should profile the pages to see the real bottlenecks. Even when you are -like me- a fan of the optimized codes, putting a stronger/more hardware into the equation is cheaper on the long run.
If you still need to use caches, consider using an existing solution, like memcached.

How to get Google like speeds with php?

I am using PHP with the Zend Framework and Database connects alone seem to take longer than the 0,02 seconds Google takes to do a query. The wierd thing today I watched a video that said Google connects to 1000 servers for a single query. With latency I would expect one server for every query to be more efficent than having multiple servers in different datacenters handeling stuff.
How do I get PHP, MySQL and the Zend Framework to work together and reach equal great speeds?
Is caching the only way? How do you optimize your code to take less time to "render".

There are many techniques that Google uses to achieve the amount of throughput it delivers. MapReduce, Google File System, BigTable are a few of those.
There are a few very good Free & Open Source alternatives to these, namely Apache Hadoop, Apache HBase and Hypertable. Yahoo! is using and promoting the Hadoop projects quite a lot and thus they are quite actively maintained.

I am using PHP with the Zend Framework
and Database connects alone seem to
take longer than the 0,02 seconds
Google takes to do a query.
Database connect operations are heavyweight no matter who you are: use a connection pool so that you don't have to initialise resources for every request.
Performance is about architecture not language.

Awhile ago Google decided to put everything into RAM.
http://googlesystem.blogspot.com/2009/02/machines-search-results-google-query.html
If you never have to query the hard drive, your results will improve significantly. Caching helps because you don't query the hard drive as much, but you still do when there is a cache miss (Unless you mean caching with PHP, which means you only compile the PHP program when the source has been modified).

It really depends on what you are trying to do, but here are some examples:
Analyze your queries with explain. In your dev environment you can output your queries and execution time to the bottom of the page - reduce the number of queries and/or optimize those that are slow.
Use a caching layer. Looks like Zend can be memcache enabled. This can potentially greatly speed up your application by sending requests to the ultra-fast caching layer instead of the db.
Look at your front-end loading time. Use Yahoo's YSlow add-on to Firebug. Limit http requests, set far-future headers to cache js, css and images. Etc.
You can get lightning speeds on your web app, probably not as fast as google, if you optimize each layer of your application. Your db connect times are probably not the slowest part of your app.

Memcached is a recommended solution for optimizing storage/retrieval in memory on Linux.

PHP scripts by default are interpreted every time they are called by the http server, so every call initiates script parsing and probably compilation by the Zend Engine.
You can get rid of this bottleneck by using script caching, like APC. It keeps the once compiled PHP script in memory/on disk and uses it for all subsequent requests. Gains are often significant, especially in PHP apps created with sophisticated frameworks like ZF.
Every request by default opens up a connection to the database, so you should use some kind of database connection pooling or persistent connections (which don't always work, depending on http server/php configuration). I have never tried, but maybe there's a way to use memcache to keep database connection handles.
You could also use memcache for keeping session data, if they're used on every request. Their persistence is not that important and memcache helps make it very fast.
The actual "problem" is that PHP works a bit different than other frameworks, because it works in a SSI (server-side includes) way - every request is handled by http server and if it requires running a PHP script, its interpreter is initialized and scripts loaded, parsed, compiled and run. This can be compared to getting into the car, starting the engine and going for 10 meters.
The other way is, let's say, an application-server way, in which the web application itself is handling the requests in its own loop, always sharing database connections and not initializing the runtime over and over. This solution gives much lower latency. This on the other hand can be compared to already being in a running car and using it to drive the same 10 meters. ;)
The above caching/precompiling and pooling solutions are the best in reducing the init overhead. PHP/MySQL is still a RDBMS-based solution though, and there's a good reason why BigTable is, well, just a big, sharded, massively distributed hashtable (a bit of oversimplification, I know) - read up on High Scalability.

If it's for a search engine, the bottleneck is the database, depending of its size.
In order to speed-up search on full text on a large set, you can use Sphinx. It can be configured either on 1 or multiple servers. However, you will have to adapt existing querying code, as Sphinx runs as a search daemon (libs are available for most languages)

Google have a massive, highly distributed system that incorporates a lot of proprietary technology (including their own hardware, and operating, file and database systems).
The question is like asking: "How can I make my car be a truck?" and essentially meaningless.

According to the link supplied by #Coltin, google response times are in the region of .2 seconds, not .02 seconds. As long as your application has an efficient design, I believe you should be able to achieve that on a lot of platforms. Although I do not know PHP it would surpise me if .2 seconds is a problem.

APC code caching;
Zend_Cache with APC or Memcache backend;
CDN for the static files;

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.