How to get Google like speeds with php?

How to get Google like speeds with php? - php

I am using PHP with the Zend Framework and Database connects alone seem to take longer than the 0,02 seconds Google takes to do a query. The wierd thing today I watched a video that said Google connects to 1000 servers for a single query. With latency I would expect one server for every query to be more efficent than having multiple servers in different datacenters handeling stuff.
How do I get PHP, MySQL and the Zend Framework to work together and reach equal great speeds?
Is caching the only way? How do you optimize your code to take less time to "render".

There are many techniques that Google uses to achieve the amount of throughput it delivers. MapReduce, Google File System, BigTable are a few of those.
There are a few very good Free & Open Source alternatives to these, namely Apache Hadoop, Apache HBase and Hypertable. Yahoo! is using and promoting the Hadoop projects quite a lot and thus they are quite actively maintained.

I am using PHP with the Zend Framework
and Database connects alone seem to
take longer than the 0,02 seconds
Google takes to do a query.
Database connect operations are heavyweight no matter who you are: use a connection pool so that you don't have to initialise resources for every request.
Performance is about architecture not language.

Awhile ago Google decided to put everything into RAM.
http://googlesystem.blogspot.com/2009/02/machines-search-results-google-query.html
If you never have to query the hard drive, your results will improve significantly. Caching helps because you don't query the hard drive as much, but you still do when there is a cache miss (Unless you mean caching with PHP, which means you only compile the PHP program when the source has been modified).

It really depends on what you are trying to do, but here are some examples:
Analyze your queries with explain. In your dev environment you can output your queries and execution time to the bottom of the page - reduce the number of queries and/or optimize those that are slow.
Use a caching layer. Looks like Zend can be memcache enabled. This can potentially greatly speed up your application by sending requests to the ultra-fast caching layer instead of the db.
Look at your front-end loading time. Use Yahoo's YSlow add-on to Firebug. Limit http requests, set far-future headers to cache js, css and images. Etc.
You can get lightning speeds on your web app, probably not as fast as google, if you optimize each layer of your application. Your db connect times are probably not the slowest part of your app.

Memcached is a recommended solution for optimizing storage/retrieval in memory on Linux.

PHP scripts by default are interpreted every time they are called by the http server, so every call initiates script parsing and probably compilation by the Zend Engine.
You can get rid of this bottleneck by using script caching, like APC. It keeps the once compiled PHP script in memory/on disk and uses it for all subsequent requests. Gains are often significant, especially in PHP apps created with sophisticated frameworks like ZF.
Every request by default opens up a connection to the database, so you should use some kind of database connection pooling or persistent connections (which don't always work, depending on http server/php configuration). I have never tried, but maybe there's a way to use memcache to keep database connection handles.
You could also use memcache for keeping session data, if they're used on every request. Their persistence is not that important and memcache helps make it very fast.
The actual "problem" is that PHP works a bit different than other frameworks, because it works in a SSI (server-side includes) way - every request is handled by http server and if it requires running a PHP script, its interpreter is initialized and scripts loaded, parsed, compiled and run. This can be compared to getting into the car, starting the engine and going for 10 meters.
The other way is, let's say, an application-server way, in which the web application itself is handling the requests in its own loop, always sharing database connections and not initializing the runtime over and over. This solution gives much lower latency. This on the other hand can be compared to already being in a running car and using it to drive the same 10 meters. ;)
The above caching/precompiling and pooling solutions are the best in reducing the init overhead. PHP/MySQL is still a RDBMS-based solution though, and there's a good reason why BigTable is, well, just a big, sharded, massively distributed hashtable (a bit of oversimplification, I know) - read up on High Scalability.

If it's for a search engine, the bottleneck is the database, depending of its size.
In order to speed-up search on full text on a large set, you can use Sphinx. It can be configured either on 1 or multiple servers. However, you will have to adapt existing querying code, as Sphinx runs as a search daemon (libs are available for most languages)

Google have a massive, highly distributed system that incorporates a lot of proprietary technology (including their own hardware, and operating, file and database systems).
The question is like asking: "How can I make my car be a truck?" and essentially meaningless.

According to the link supplied by #Coltin, google response times are in the region of .2 seconds, not .02 seconds. As long as your application has an efficient design, I believe you should be able to achieve that on a lot of platforms. Although I do not know PHP it would surpise me if .2 seconds is a problem.

APC code caching;
Zend_Cache with APC or Memcache backend;
CDN for the static files;

Related

Linux server: Would a cache scheme help reduce hits to 3rd-party server?

I have a situation where my Linux server will be running a website which gets some of its data from a 3rd-party server through a SOAP interface. The data isn't exactly real-time, but it does change every 5 minutes or so. I was told not to have our website hammer their website for data, which I can completely understand.
So I wondered if this was a good candiate to use a cache scheme of some type. Where when a user comes to our web page to display the data, if it's less than 5 minutes old (for example), it would get that data from our server instead of polling the 3rd-party for it. This way, if 100 users at once come to our website, our server won't be access the 3rd-party website 100 times to share the same exact data within a given time-frame.
Is this a practical thing to do in PHP? Or should this be written in a faster language when it comes to caching? Are their cache packages for this sort of situation which can be used along with a PHP Joomla application? Thanks!

I think memcached is a good choice.
You can set timeout when you store content to memcached server, if key-value missed, retrieve data from 3rd-part server and store again.
There is memcached extension for PHP, check doc here.

There's lots of ways to solve the problem -we can't say which is the right one without knowing a lot more about the constraints you are working in or how the service is used. If you are using Joomla then you're obviously not bothered about performance - it would be really hard to write anything which has a measurable impact on your html generation times. This does not need to "be written in a faster language", but....
can you install additional software?
have you got access to cron?
at what rate is the service consumed?
how many webservers do you have consuming the service - do they have a shared filesystem? Are they on the same sub-net?
Is the SOAP response cacheable?
how do you deal with non-availability of the service?
For a very scalable solution I would suggest running a simple forward proxy (e.g. squid) but do make sure that it's not accessible from the internet. Sven (see comment elsewhere) is right about POST sometimes not being cacheable - but you can cache the response from a surrogate script on your own site accessed via GET returning appropriate caching instructions - and this could return the data as a serialized php array / object which is much less expensive to process. Indeed whichever method you choose I would recommend caching the parsed response - not the XML. This also allows you to override poor caching information from the service.
If the rate is less than around 1 per minute then the cron solution is overkill. But if its more than 20 per minute then it makes a lot of sense. If you don't have access to cron / can't install your own software then you might consider simply caching the response and refreshing the cache on demand. Don't bother with memcache unless you are already using it. APC is faster on a single server - but memcache is distributed. If you have multiple servers then use whatever cluster storage you are currently sharing your data in (distributed filesystem / database cluster / shared filesystem....).
Don't try to use locking / mutexes around the cache refresh unless you really have to (i.e. only if accessing the service more than once every 5 minutes is a mortal sin) - this gets real complicated real quick - it's too easy to introduce bugs.
Do make sure you buffer and validate any responses before writing them to the cache.

Yes, just use HTTP. Most of the heavy lifting has already been built into your web server.
Since SOAP is just a simple HTTP POST request with an XML body, you could set up your website or HTTP API in front of the SOAP endpoint to act like a translator to regular HTTP, attaching the appropriate HTTP caching headers on the transformed response body and then configure an NGinx reverse proxy in front of it.
Notably: if the transformation is simple you could just use XSLT to transform the response body from the SOAP API and remove the web service layer entirely.

Your problem is a very small one, which does not require a complicated solution.
You could write a small cron job that is executed every five minutes, sends the request to the SOAP server, and stores the result in a local file. If any script needs the data, it reads the local file. This will result in 288 requests to the SOAP server per day, and have excellent performance for any script call that needs the results because they are already on your server.
If you do not have cron jobs available and cannot fake them, any other cache will do. You really don't need fancy stuff like Memcached, unless it already is available. Storing the result to a cache file will work as well. Note that if you have to really fetch the SOAP result from the origin, this will take some more time and might affect the perceived performance of your site.
There are plenty of frameworks which also offer cache support, and if you use one you should investigate if there is support included. I'm not sure if Joomla has something appropriate for you. Otherwise, you can implement something yourself. It isn't that hard.

Cache functionality comes in various flavours:
memory-based, where a separate process on the server holds data in RAM (or overflows to disk) and you query it like you would a database; very efficient and powerful, and will have options to manage storage use and clear up after themselves, but requires setting up additional software on the server; e.g. memcached, redis
file-based, where you just write the data to disk; less efficient, but can be implemented in "user-land" code, i.e. pure PHP; beware of filling up your disk with variant caches that have expired but not been cleaned up; many frameworks have an implementation of this built in
database-backed, where you push data into an RDBMS (e.g. MySQL, PostgreSQL) or fully-featured NoSQL store (e.g. MongoDB); might make sense if you have a large amount of data, and can trade a bit of performance; as with files, you need to make sure that stale data is cleaned up
In each case, the basic idea is that you create a "key" that can tell one request from another (e.g. the name of the SOAP call and its input parameters, serialized), and pick a "lifetime" (how long you want to carry on using the same copy of the data). The caching engine or library then checks for a cache with that key, and if it is still within its "lifetime" returns the previously cached data. If there is a "cache miss" (there is no cache for that key, or it has expired), you perform the costly operation (in your case, the SOAP call) and save to the cache, using the same key.
You can do more complex things, like pre-caching things in the background so that there is never a cache miss, or having some code paths which accept stale data in order to return quickly, but these can generally be implemented on top of whatever you're using as the main caching solution.
Edit Another important decision is at what level of granularity to cache the data, in relation to processing it. At one extreme, you could cache each individual SOAP call: simple to set up, but means re-processing the same data repeatedly, and can cause problems if two responses are related, but cached independently and may get out of sync. At the other extreme, you can cache whole rendered pages: pages load very fast once cached, but creating variations based on the same data without repeating work becomes tricky. In between are various points in your code where you have processed and combined data into meaningful chunks: if your application is well-written, these are the input and output of major functions, or possibly even complete model objects; this is more work to implement, as you have to choose the right keys (avoiding two contexts overwriting each other's caches while ignoring variables that have no impact on the data in question) and values (avoiding repeats of costly work without having to store huge blobs of data which will be slow to unserialize and use up the capacity of your cache store). As with anything else, no approach suits all needs, and a complex application will probably involve caching at multiple levels for different purposes.

when a webpage is generated in real time which memory it uses server-side or client side?

I have written a php code which will get one id from database and using that id it will use some API's provided by other websites and generate a page.
here my question is where this generated page will occupy the space on the server or on the client machine?
if 10000 people will open the same page then will my server be slow down in this case.
should i store all data for that API in our MySQL-database.
what will make it fast & safe...
Please suggest me...
Thanks

I have written a php code which will get one id from database and
using that id it will use some API's provided by other websites and
generate a page.
here my question is where this generated page will occupy the space on
the server or on the client machine?
The generated page will occur on the client if you only fetch one id from your database. For this you could first do a jquery.get to fetch id from your server. Next you could get data from other API's using JSONP(JSON with padding). But for this to work the API's off course need to support JSONP, because the javascript clients can't fetch data using jquery.get because of same origin policy, but lucky JSONP can be used for that. Finally you could just easily append data to the DOM using .html. You should be carefull doing this with other API's and need to be sure these are safe API's because else you would be vulnerable to XSS. If you are not certain you should use .text instead.
should i store all data for that API in our MySQL-database.
It depends if the API's do provide JSONP.
what will make it fast & safe...
Fast
APC to cache compiled bytecode. This will speed up your website tremendously without even changing a single line in your code-base.
in memory database as redis or memcached. You can also use APC to store data in memory. This will speed up your website tremendously, because touching the disc(spinning the disc to right sector, etc) is very expensive and using memory is very fast.
The No-Framework approach will make your site fast, because PHP is dynamic language you should try to do as little as possible.
Tackle low hanging fruit only. Remember that "Premature optimization is the root of all evil". Rasmus Lerdorf teaches you how to do this in this video Simple is Hard from DrupalCon 2008. The slides are available at PHP's talks section
Safe
Read up OWASP top 10
Protect against XSS using filter
Protect against SQL-injection using PDO(prepared statements).
Protect against CSRF

It all depends on your garbage collection. The memory will be used by your server while the page is being rendered, but once the output is sent to the browser, PHP will no longer care. Now, if you have really bad garbage collection, Apache can certainly run out of memory. It has built-in garbage collection protocols but if you rely on those, you're just asking for dropped packets and page hangs.
If 10000 people access your server at the same time, it'll likely be your CPU that will be the bottleneck.
This is why tried-and-true PHP frameworks are ideal for large projects because most of them have taken all this into account and have built-in optimization implementations.

It depends really.
Factors are:-
Time taken to generate request's response
Size of the request
Concurrent connections
Web Server
Speed of the api
and many more...
You server is not likely to slow down if there are 10000 requests made of a period of time but if there are 10000 requests made every second then there is going to be a likely impact and this depends on the list given. If there are more concurrent connections to the server then each connection will use up some memory and memory overflow may halt the server. So make sure that even you get that many requests those requests are served fast and their connections and processes are not kept in the memory for long. This saves memory and from your server crashing.
However if the output for the api is going to be the same for various users then it would be wiser to keep the object in the memory as memory access is much faster than a disk access.

If 10000 people will be grabbing the same page that you're dynamically creating by manipulating another site's API, it sounds like you're pulling data from the other site, and constructing a page using PHP on your server. So yes, that consumes a small amount of memory and processing resources on your system, per hit. Memory use may be limited by the number of threads or forks your webserver is allowed to use. Processing power will not be limited artifically; it will be constrained by what your server can handle.
But back to that number of 10000 people grabbing the same page, again. If that's a possibility you would want to generate the page locally, and cache it somehow so that it only has to be generated once. It doesn't make sense to generate the same output 10,000 times when you could generate it once and let it be fetched 10,000 times instead. Then it just becomes a matter of deciding when the cache is stale.

Does a separate MySQL server make sense when using Nginx instead of Apache?

Consider a web app in which a call to the app consists of PHP script running several MySQL queries, some of them memcached.
The PHP does not do very complex job. It is mainly serving the MySQL data with some formatting.
In the past it used to be recommended to put MySQL and the app engine (PHP/Apache) on separate boxes.
However, when the data can be divided horizontally (for example when there are ten different customers using the service and it is possible to divide the data per customer) and when Nginx +FastCGI is used instead of heavier Apache, doesn't it make sense to put Nginx Memcache and MySQL on the same box? Then when more customers come, add similar boxes?
Background: We are moving to Amazon Ec2. And a separate box for MySQL and app server means double EBS volumes (needed on app servers to keep the code persistent as it changes often). Also if something happens to the database box, more customers will fail.
Clarification: Currently the app is running with LAMP on a single server (before moving to EC2).

If your application architecture is already designed to support Nginx and MySQL on separate instances, you may want to host all your services on the same instance until you receive enough traffic that justifies the separation.
In general, creating new identical instances with the full stack (Nginx + Your Application + MySQL) will make your setup much more difficult to maintain. Think about taking backups, releasing application updates, patching the database engine, updating the database schema, generating reports on all your clients, etc. If you opt for this method, you would really need to find some big advantages in order to offset all the disadvantages.

You need to measure carefully how much memory overhead everything has - I can't see enginex vs Apache making much difference, it's PHP which will use all the RAM (this in turn depends on how many processes the web server chooses to run, but that's more of a tuning issue).
Personally I'd stay away from enginex on the grounds that it is too risky to run such a weird server in production.
Databases always need lots of ram, and the only way you can sensibly tune the memory buffers is to have them on dedicated servers. This is assuming you have big data.
If you have very small data, you could keep it on the same box.
Likewise, memcached makes almost no sense if you're not running it on dedicated boxes. Taking memory from MySQL to give to memcached is really robbing Peter to pay Paul. MySQL can cache stuff in its innodb_buffer_pool quite efficiently (This saves IO, but may end up using more CPU as you won't cache presentation logic etc, which may be possible with memcached).
Memcached is only sensible if you're running it on dedicated boxes with lots of ram; it is also only sensible if you don't have enough grunt in your db servers to serve the read-workload of your app. Think about this before deploying it.

If your application is able to work with PHP and MySQL on different servers (I don't see why this wouldn't work, actually), then, it'll also work with PHP and MySQL on the same server.
The real question is : will your servers be able to handle the load of both Apache/nginx/PHP, MySQL, and memcached ?
And there is only one way to answer that question : you have to test in a "real" "production" configuration, to determine own loaded your servers are -- or use some tool like ab, siege, or OpenSTA to "simulate" that load.
If there is not too much load with everything on the same server... Well, go with it, if it makes the hosting of your application cheapier ;-)

Memcache to deal with high latency web services APIs - good idea?

I have a PHP application that calls web services APIs to get some objects before rendering a web page that incorporates those objects. In some cases these APIs are really slow (seconds) and that is not acceptable from a user experience point of view. Two things I know I can do...
Use ajax and make these calls in the background
Time out the call and degrade gracefully if it is taking too long
Neither is ideal, so I was thinking about using memcache (the PHP extension for memcached) to cache the object that I get from the 3rd party web service. The objects will be loaded many times by different users loading the same page, so this seems to make sense.
The objects are relatively small (~1k).
Does this sound like a reasonable approach? I know memcached was originally designed to alleviate database load, so I'm wondering whether there is a gotcha somewhere that I'm not seeing.
Thanks.

This is a perfectly legitimate use of memcache. It is not only for database load reduction, it is for caching and object storage in general. :)
Also note, PHP has two interfaces for memcached. Confusingly, they are named "memcache" and "memcached". Read these to pick between the two:
https://serverfault.com/questions/63383/memcache-vs-memcached
http://code.google.com/p/memcached/wiki/PHPClientComparison

I'd highly recommend memcache for this situation as it will:
Reduce DNS calls.
Reduce page latency.
Reduce bandwidth usage.
Your only real task is to determine how often the data you are dealing with will be changing. This will help you to optimize your expiry time for the cache key(s).

This approach may not work for you in your situation, but you might use cron jobs to call a PHP script that loads the required information then caches it to a more speedy data source (XML or Database).
This may not work if the information is updated really often or if there is a lot of different data that needs to be loaded, but it is an option. I've used this approach for other tasks that take a lot of time to complete and have found it to be a reasonable solution.

Best practices for withstanding launch day traffic burst

We are working on a website for a client that (for once) is expected to get a fair amount of traffic on day one. There are press releases, people are blogging about it, etc. I am a little concerned that we're going to fall flat on our face on day one. What are the main things you would look at to ensure (in advance without real traffic data) that you can stay standing after a big launch?
Details: This is a L/A/M/PHP stack, using an internally developed MVC framework. This is currently being launched on one server, with Apache and MySQL both on it, but we can break that up if need be.
We are already installing Memcached and doing as much PHP-level caching as we can think of. Some of the pages are rather query intensive, and we are using Smarty as our template engine. Keep in mind there is no time to change any of these major aspects--this is the just the setup. What sorts of things should we watch out for?

Measure first, and then optimize. Have you done any load testing? Where are the bottlenecks?
Once you know your bottlenecks then you can intelligently decide if you need additional database boxes or web boxes. Right now you'd just be guessing.
Also, how does your load testing results compare against your expected traffic? Can you handle two times the expected traffic? Five times? How easy/fast can you acquire and release extra hardware? I'm sure the business requirement is to not fail during launch, so make sure you have lots of capacity available. You can always release it afterwards when the load has stabilized and you know what you need.

I would at least factor out all static content. Set up another vhost somewhere else and load all the graphics, CSS, and JavaScript onto it. You can buy some extra cycles, offloading the serving of that type of content. If you're really concerned, you can signup and use a content distribution service. There are lots now similar to Akamai and quite cheap.
Another idea might be to utilize Apache mod_proxy to keep the generated page output for a specific amount of time. APC would also be quite usable... You could employ output buffering capture + the last modified time of related data on the page, and use the APC cached version. If the page isn't valid any more, you regenerate and store in APC again.
Good luck. It'll be a learning experience!

Have a beta period where you allow in as many users as you can handle, measure your site's performance, and work out bugs before you go live.
You can either control the number of users explicitly in a private beta, or a Google-style semi-public beta where each user has a number of referrals that they can offer to their friends.

To prepare or handle a spike (or peak) performance, I would first determine whether you are ready through some simple performance testing with something like jmeter.
It is easy to set up and get started and will give you early metrics whether you will handle an expected peak load.
However, given your time constraints, other steps to take would be to prepare static versions of content that will attract the highest attention (such as press releases, if your launch day). Also ensure that you are making the best use of client-side caching (one fewer request to your server can make all the difference). The web is already designed for extremely high scalability and effective use content caching is your best friend in these situations.
There is an excellent podcast on high scalability on software engineering radio on the design of the new Guardian website when things calm down.
Good luck on the launch.

I'd, personally, do a few things
1) Put in some sort of load balancer/database replication system
This means that you can have your service spread across multiple servers. Can't afford to have more than one server permanently? Use Amazon E3 - It's good for putting in place for things like this (switch on a few more servers to handle the load)
2) Code in some "High Load" restrictions
For example, if your searching is inefficient - switch it off when load gets to a certain level. "Sorry, we're busy, try again later for searching"
3) Load test... Use something like ApacheBench to stress test your servers.
4) Personally, I think that switching "Keep-Alive" Connections off is better. It may slightly reduce overall performance, but - it means that instead of having something where the site works well for a few people, and the others get timeouts, everyone gets inconsistent service, if it gets to that level
Linux Format did a good article on "How to survive a slashdotting"... which I've found useful in the past. It's available online as a PDF

Basic first steps to harden your site for high traffic.
Use a low-cost tool like https://browsermob.com/ to load-test your site. At a minimum, you should be looking at 100K unique visitors per hour. If you get an ad off of the MSN home page, look to be able to handle 500K unique visitors per hour.
Move all static graphic/video content to a CDN. Edgecast and Amazon are two excellent choices.
Use Jet Profiler to profile your MySQL server to analyze any slow performing queries. Minor changes can have huge benefits.

Look into using Varnish - it's a caching reverse proxy server (like Squid, but much more single purpose).
I've run some pretty big sites behind it, and it seemed to work really well.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.