I have a small question about crawling a web page in PHP. I have to crawl about 90 000 products on one big eshop. I tried it in PHP, but one product takes about 2-3 sec and that's bad. Any tips, how to do it faster? Maybe a C++ multithread version? But what about time of a HTTP request? I mean, is it PHP's limitation or not? Thank you for the tips.
That's an extremely vague question. When you benchmarked the code you have, what was the slowest part? Was it network transfer times? Using a different language (or multiple threads) won't change that.
Was it time spent parsing the page? How are you doing that? If you're using an XML library to parse the entire DOM, could you get away with just looking for keywords (or even regular expressions)? That's less precise (and in some sense less correct) but perhaps it's faster.
What algorithms are you using for your analysis? Would other data structures provide better performance? As one simple example, if you spend a lot of time iterating over an array, perhaps a hash map is more appropriate.
PHP can be run in multiple processes. What happens if you kick off multiple instances of your script at once (on different pages)? Does the total time decrease?
Ultimately you've described a very general problem so I can't offer very specific solutions, but there is no inherent reason why PHP is inappropriate for this task. When you've identified what's slow (regardless of what language you're using) you should be able to more precisely address how to fix it.
I don't think it's PHPs problem but it could be depending on connection speed/computer speed. I've never had a speed problem with PHP/cURL though.
Just do multiple threads (ie. multiple connections at once), I suggest you use cURL but that's only because I'm familiar with it.
Here's a guide I've used for multiple threads for scraping with cURL:
http://semlabs.co.uk/journal/object-oriented-curl-class-with-multi-threading
Be VERY careful not to accidentally cause a denial of service situation with your scripts. But I'm sure you're already away of that possibility.
If your program is running slowly, my advice would be to run a profiler on it, and analyse why it's running slowly.
This advice applies to any language, but in the case of PHP, the profiler software you need is called xDebug.
This is a PHP extension, so you need to install it into your server. If you're running on an ISP's server, then you may not have permission to do this, but you can always install it with PHP on your local PC and run your tests there.
Once you've got xDebug installed, switch on the profiling features in PHP.ini (see the xDebug documentation for instruction on this), and run your program. It will then generate profiler files, which can be used to analyse what the program is doing.
Download KCacheGrind to perform the analysis. This will generate call tree information, showing exactly what happened as the program ran, and how long every function call took.
With this information, you can look for the function calls that are running slowly, and work out what's happening. Usually the reason for slow code is some kind of inefficiency in the way something is written; xDebug will help you find it.
Hope that helps.
You have 99% probability that PHP is NOT the problem. It is rather the eshop webserver or any other network latency.
I know this for sure because I have been doing this for months now, and even if your code has lots of regular expressions, data scraping is really fast in PHP.
The solution to speed this ? Pre cache all the website with a command line crawler since disk space is cheap. curl can do this, and httrack as well. It will be much faster and stable than PHP doing the crawling.
Then let PHP do the parsing alone, you will see hopefully PHP chomping dozens of pages per minute, hope this helps :)
Right now I'm running 50 PHP (in CLI mode) individual workers (processes) per machine that are waiting to receive their workload (job). For example, the job of resizing an image. In workload they receive the image (binary data) and the desired size. The worker does it's work and returns the resized image back. Then it waits for more jobs (it loops in a smart way). I'm presuming that I have the same executable, libraries and classes loaded and instantiated 50 times. Am I correct? Because this does not sound very effective.
What I'd like to have now is one process that handles all this work and being able to use all available CPU cores while having everything loaded only once (to be more efficient). I presume a new thread would be started for each job and after it finishes, the thread would stop. More jobs would be accepted if there are less than 50 threads doing the work. If all 50 threads are busy, no additional jobs are accepted.
I am using a lot of libraries (for Memcached, Redis, MogileFS, ...) to have access to all the various components that the system uses and Python is pretty much the only language apart from PHP that has support for all of them.
Can Python do what I want and will it be faster and more efficient that the current PHP solution?
Most probably - yes. But don't assume you have to do multithreading. Have a look at the multiprocessing module. It already has an implementation of a Pool included, which is what you could use. And it basically solves the GIL problem (multithreading can run only 1 "standard python code" at any time - that's a very simplified explanation).
It will still fork a process per job, but in a different way than starting it all over again. All the initialisations done- and libraries loaded before entering the worker process will be inherited in a copy-on-write way. You won't do more initialisations than necessary and you will not waste memory for the same libarary/class if you didn't actually make it different from the pre-pool state.
So yes - looking only at this part, python will be wasting less resources and will use a "nicer" worker-pool model. Whether it will really be faster / less CPU-abusing, is hard to tell without testing, or at least looking at the code. Try it yourself.
Added: If you're worried about memory usage, python may also help you a bit, since it has a "proper" garbage collector, while in php GC is a not a priority and not that good (and for a good reason too).
Linux has shared libraries, so those 50 php processes use mostly the same libraries.
You don't sound like you even have a problem at all.
"this does not sound very effective." is not a problem description, if anything those words are a problem on their own. Writing code needs a real reason, else you're just wasting time and/or money.
Python is a fine language and won't perform worse than php. Python's multiprocessing module will probably help a lot too. But there isn't much to gain if the php implementation is not completly insane. So why even bother spending time on it when everything works? That is usually the goal, not a reason to rewrite ...
If you are on a sane operating system then shared libraries should only be loaded once and shared among all processes using them. Memory for data structures and connection handles will obviously be duplicated, but the overhead of stopping and starting the systems may be greater than keeping things up while idle. If you are using something like gearman it might make sense to let several workers stay up even if idle and then have a persistent monitoring process that will start new workers if all the current workers are busy up until a threshold such as the number of available CPUs. That process could then kill workers in a LIFO manner after they have been idle for some period of time.
I'm studying high-performance coding for websites in PHP, and this idea popped into my mind:
We know that accessing a database uses a significant amount of CPU usage, so we cache such data, saving it to the HDD. But I was wondering, can't it rest in the RAM of the server, so I can access it even more faster?
You might want to check out memcached:
http://www.php.net/manual/en/intro.memcache.php
PHP normally comes with APC as a bytecode cache. You can also use it as a local cache. If you need something in a distributed/clustered environment, then memcached (plus possibly beanstalkd) is the way to go.
XCache, eaccelerator, apc and memcache allow you to save items to semi persistent memory (you don't necessarily know when an item will expire in most cases). It isn't the same as a database, more like a key/value list. The downside being that it requires a third party library, so you might be a bit limited depending on your environment.
I think you might be able to get the same effect using shared memory (via php's shmop_ functions). But I have never used them or know if they are included with php's library so someone feel free to bash me or edit out this mention.
If your server is ANY good, then it will already do so. But of course, it may be the case that your server is serving a few thousand other tasks besides yours as well, meaning you don't have that server's cache all for yourself.
And if there really are a few thousand others being served besides you, then the probability just gets higher that there is at least one nutcase among those thousands of others, who is doing something that he really shouldn't be doing but that the server has not been programmed to detect, not been programmed to stop, but just been programmed to try and make the best of it, at the expense of availability of resources for the x999 "responsible" users.
A site I built with Kohana was slammed with an enormous amount of traffic yesterday, causing me to take a step back and evaluate some of the design. I'm curious what are some standard techniques for optimizing Kohana-based applications?
I'm interested in benchmarking as well. Do I need to setup Benchmark::start() and Benchmark::stop() for each controller-method in order to see execution times for all pages, or am I able to apply benchmarking globally and quickly?
I will be using the Cache-library more in time to come, but I am open to more suggestions as I'm sure there's a lot I can do that I'm simply not aware of at the moment.
What I will say in this answer is not specific to Kohana, and can probably apply to lots of PHP projects.
Here are some points that come to my mind when talking about performance, scalability, PHP, ...
I've used many of those ideas while working on several projects -- and they helped; so they could probably help here too.
First of all, when it comes to performances, there are many aspects/questions that are to consider:
configuration of the server (both Apache, PHP, MySQL, other possible daemons, and system); you might get more help about that on ServerFault, I suppose,
PHP code,
Database queries,
Using or not your webserver?
Can you use any kind of caching mechanism? Or do you need always more that up to date data on the website?
Using a reverse proxy
The first thing that could be really useful is using a reverse proxy, like varnish, in front of your webserver: let it cache as many things as possible, so only requests that really need PHP/MySQL calculations (and, of course, some other requests, when they are not in the cache of the proxy) make it to Apache/PHP/MySQL.
First of all, your CSS/Javascript/Images -- well, everything that is static -- probably don't need to be always served by Apache
So, you can have the reverse proxy cache all those.
Serving those static files is no big deal for Apache, but the less it has to work for those, the more it will be able to do with PHP.
Remember: Apache can only server a finite, limited, number of requests at a time.
Then, have the reverse proxy serve as many PHP-pages as possible from cache: there are probably some pages that don't change that often, and could be served from cache. Instead of using some PHP-based cache, why not let another, lighter, server serve those (and fetch them from the PHP server from time to time, so they are always almost up to date)?
For instance, if you have some RSS feeds (We generally tend to forget those, when trying to optimize for performances) that are requested very often, having them in cache for a couple of minutes could save hundreds/thousands of request to Apache+PHP+MySQL!
Same for the most visited pages of your site, if they don't change for at least a couple of minutes (example: homepage?), then, no need to waste CPU re-generating them each time a user requests them.
Maybe there is a difference between pages served for anonymous users (the same page for all anonymous users) and pages served for identified users ("Hello Mr X, you have new messages", for instance)?
If so, you can probably configure the reverse proxy to cache the page that is served for anonymous users (based on a cookie, like the session cookie, typically)
It'll mean that Apache+PHP has less to deal with: only identified users -- which might be only a small part of your users.
About using a reverse-proxy as cache, for a PHP application, you can, for instance, take a look at Benchmark Results Show 400%-700% Increase In Server Capabilities with APC and Squid Cache.
(Yep, they are using Squid, and I was talking about varnish -- that's just another possibility ^^ Varnish being more recent, but more dedicated to caching)
If you do that well enough, and manage to stop re-generating too many pages again and again, maybe you won't even have to optimize any of your code ;-)
At least, maybe not in any kind of rush... And it's always better to perform optimizations when you are not under too much presure...
As a sidenote: you are saying in the OP:
A site I built with Kohana was slammed with
an enormous amount of traffic yesterday,
This is the kind of sudden situation where a reverse-proxy can literally save the day, if your website can deal with not being up to date by the second:
install it, configure it, let it always -- every normal day -- run:
Configure it to not keep PHP pages in cache; or only for a short duration; this way, you always have up to date data displayed
And, the day you take a slashdot or digg effect:
Configure the reverse proxy to keep PHP pages in cache; or for a longer period of time; maybe your pages will not be up to date by the second, but it will allow your website to survive the digg-effect!
About that, How can I detect and survive being “Slashdotted”? might be an interesting read.
On the PHP side of things:
First of all: are you using a recent version of PHP? There are regularly improvements in speed, with new versions ;-)
For instance, take a look at Benchmark of PHP Branches 3.0 through 5.3-CVS.
Note that performances is quite a good reason to use PHP 5.3 (I've made some benchmarks (in French), and results are great)...
Another pretty good reason being, of course, that PHP 5.2 has reached its end of life, and is not maintained anymore!
Are you using any opcode cache?
I'm thinking about APC - Alternative PHP Cache, for instance (pecl, manual), which is the solution I've seen used the most -- and that is used on all servers on which I've worked.
See also: Slides APC Facebook,
Or Benchmark Results Show 400%-700% Increase In Server Capabilities with APC and Squid Cache.
It can really lower the CPU-load of a server a lot, in some cases (I've seen CPU-load on some servers go from 80% to 40%, just by installing APC and activating it's opcode-cache functionality!)
Basically, execution of a PHP script goes in two steps:
Compilation of the PHP source-code to opcodes (kind of an equivalent of JAVA's bytecode)
Execution of those opcodes
APC keeps those in memory, so there is less work to be done each time a PHP script/file is executed: only fetch the opcodes from RAM, and execute them.
You might need to take a look at APC's configuration options, by the way
there are quite a few of those, and some can have a great impact on both speed / CPU-load / ease of use for you
For instance, disabling [apc.stat](https://php.net/manual/en/apc.configuration.php#ini.apc.stat) can be good for system-load; but it means modifications made to PHP files won't be take into account unless you flush the whole opcode-cache; about that, for more details, see for instance To stat() Or Not To stat()?
Using cache for data
As much as possible, it is better to avoid doing the same thing over and over again.
The main thing I'm thinking about is, of course, SQL Queries: many of your pages probably do the same queries, and the results of some of those is probably almost always the same... Which means lots of "useless" queries made to the database, which has to spend time serving the same data over and over again.
Of course, this is true for other stuff, like Web Services calls, fetching information from other websites, heavy calculations, ...
It might be very interesting for you to identify:
Which queries are run lots of times, always returning the same data
Which other (heavy) calculations are done lots of time, always returning the same result
And store these data/results in some kind of cache, so they are easier to get -- faster -- and you don't have to go to your SQL server for "nothing".
Great caching mechanisms are, for instance:
APC: in addition to the opcode-cache I talked about earlier, it allows you to store data in memory,
And/or memcached (see also), which is very useful if you literally have lots of data and/or are using multiple servers, as it is distributed.
of course, you can think about files; and probably many other ideas.
I'm pretty sure your framework comes with some cache-related stuff; you probably already know that, as you said "I will be using the Cache-library more in time to come" in the OP ;-)
Profiling
Now, a nice thing to do would be to use the Xdebug extension to profile your application: it often allows to find a couple of weak-spots quite easily -- at least, if there is any function that takes lots of time.
Configured properly, it will generate profiling files that can be analysed with some graphic tools, such as:
KCachegrind: my favorite, but works only on Linux/KDE
Wincachegrind for windows; it does a bit less stuff than KCacheGrind, unfortunately -- it doesn't display callgraphs, typically.
Webgrind which runs on a PHP webserver, so works anywhere -- but probably has less features.
For instance, here are a couple screenshots of KCacheGrind:
(source: pascal-martin.fr)
(source: pascal-martin.fr)
(BTW, the callgraph presented on the second screenshot is typically something neither WinCacheGrind nor Webgrind can do, if I remember correctly ^^ )
(Thanks #Mikushi for the comment) Another possibility that I haven't used much is the the xhprof extension : it also helps with profiling, can generate callgraphs -- but is lighter than Xdebug, which mean you should be able to install it on a production server.
You should be able to use it alonside XHGui, which will help for the visualisation of data.
On the SQL side of things:
Now that we've spoken a bit about PHP, note that it is more than possible that your bottleneck isn't the PHP-side of things, but the database one...
At least two or three things, here:
You should determine:
What are the most frequent queries your application is doing
Whether those are optimized (using the right indexes, mainly?), using the EXPLAIN instruction, if you are using MySQL
See also: Optimizing SELECT and Other Statements
You can, for instance, activate log_slow_queries to get a list of the requests that take "too much" time, and start your optimization by those.
whether you could cache some of these queries (see what I said earlier)
Is your MySQL well configured? I don't know much about that, but there are some configuration options that might have some impact.
Optimizing the MySQL Server might give you some interesting informations about that.
Still, the two most important things are:
Don't go to the DB if you don't need to: cache as much as you can!
When you have to go to the DB, use efficient queries: use indexes; and profile!
And what now?
If you are still reading, what else could be optimized?
Well, there is still room for improvements... A couple of architecture-oriented ideas might be:
Switch to an n-tier architecture:
Put MySQL on another server (2-tier: one for PHP; the other for MySQL)
Use several PHP servers (and load-balance the users between those)
Use another machines for static files, with a lighter webserver, like:
lighttpd
or nginx -- this one is becoming more and more popular, btw.
Use several servers for MySQL, several servers for PHP, and several reverse-proxies in front of those
Of course: install memcached daemons on whatever server has any amount of free RAM, and use them to cache as much as you can / makes sense.
Use something "more efficient" that Apache?
I hear more and more often about nginx, which is supposed to be great when it comes to PHP and high-volume websites; I've never used it myself, but you might find some interesting articles about it on the net;
for instance, PHP performance III -- Running nginx.
See also: PHP-FPM - FastCGI Process Manager, which is bundled with PHP >= 5.3.3, and does wonders with nginx.
Well, maybe some of those ideas are a bit overkill in your situation ^^
But, still... Why not study them a bit, just in case ? ;-)
And what about Kohana?
Your initial question was about optimizing an application that uses Kohana... Well, I've posted some ideas that are true for any PHP application... Which means they are true for Kohana too ;-)
(Even if not specific to it ^^)
I said: use cache; Kohana seems to support some caching stuff (You talked about it yourself, so nothing new here...)
If there is anything that can be done quickly, try it ;-)
I also said you shouldn't do anything that's not necessary; is there anything enabled by default in Kohana that you don't need?
Browsing the net, it seems there is at least something about XSS filtering; do you need that?
Still, here's a couple of links that might be useful:
Kohana General Discussion: Caching?
Community Support: Web Site Optimization: Maximum Website Performance using Kohana
Conclusion?
And, to conclude, a simple thought:
How much will it cost your company to pay you 5 days? -- considering it is a reasonable amount of time to do some great optimizations
How much will it cost your company to buy (pay for?) a second server, and its maintenance?
What if you have to scale larger?
How much will it cost to spend 10 days? more? optimizing every possible bit of your application?
And how much for a couple more servers?
I'm not saying you shouldn't optimize: you definitely should!
But go for "quick" optimizations that will get you big rewards first: using some opcode cache might help you get between 10 and 50 percent off your server's CPU-load... And it takes only a couple of minutes to set up ;-) On the other side, spending 3 days for 2 percent...
Oh, and, btw: before doing anything: put some monitoring stuff in place, so you know what improvements have been made, and how!
Without monitoring, you will have no idea of the effect of what you did... Not even if it's a real optimization or not!
For instance, you could use something like RRDtool + cacti.
And showing your boss some nice graphics with a 40% CPU-load drop is always great ;-)
Anyway, and to really conclude: have fun!
(Yes, optimizing is fun!)
(Ergh, I didn't think I would write that much... Hope at least some parts of this are useful... And I should remember this answer: might be useful some other times...)
Use XDebug and WinCacheGrind or WebCacheGrind to profile and analyze slow code execution.
(source: jokke.dk)
Profile code with XDebug.
Use a lot of caching. If your pages are relatively static, then reverse proxy might be the best way to do it.
Kohana is out of the box very very fast, except for the use of database objects. To quote Zombor "You can reduce memory usage by ensuring you are using the database result object instead of result arrays." This makes a HUGEE performance difference on a site that is being slammed. Not only does it use more memory, it slows down execution of scripts.
Also - you must use caching. I prefer memcache and use it in my models like this:
public function get($e_id)
{
$event_data = $this->cache->get('event_get_'.$e_id.Kohana::config('config.site_domain'));
if ($event_data === NULL)
{
$this->db_slave
->select('e_id,e_name')
->from('Events')
->where('e_id', $e_id);
$result = $this->db_slave->get();
$event_data = ($result->count() ==1)? $result->current() : FALSE;
$this->cache->set('event_get_'.$e_id.Kohana::config('config.site_domain'), $event_data, NULL, 300); // 5 minutes
}
return $event_data;
}
This will also dramatically increase performance. The above two techniques improved a sites performance by 80%.
If you gave some more information about where you think the bottleneck is, I'm sure we could give some better ideas.
Also check out yslow (google it) for some other performance tips.
Strictly related to Kohana (you probably already have done this, or not):
In production mode:
Enable internal caching (this will only cache the Kohana::find_file results, but this actually can help a lot.
Disable profiler
Just my 2 cents :)
I totally agree with the XDebug and caching answers. Don't look into the Kohana layer for optimization until you've identified your biggest speed and scale bottlenecks.
XDebug will tell you were you spend the most of your time and identify 'hotspots' in your code. Keep this profiling information so you can baseline and measure performance improvements.
Example problem and solution:
If you find that you're building up expensive objects from the database each time, that don't really change often, then you can look at caching them with memcached or another mechanism. All of these performance fixes take time and add complexity to your system, so be sure of your bottlenecks before you start fixing them.
Does anybody have experience working with PHP accelerators such as MMCache or Zend Accelerator? I'd like to know if using either of these makes PHP comparable to faster web-technologies. Also, are there trade offs for using these?
Note that Zend Optimizer and MMCache (or similar applications) are totally different things. While Zend Optimizer tries to optimize the program opcode MMCache will cache the scripts in memory and reuse the precompiled code.
I did some benchmarks some time ago and you can find the results in my blog (in German though). The basic results:
Zend Optimizer alone didn't help at all. Actually my scripts were slower than without optimizer.
When it comes to caches:
* fastest: eAccelerator
* XCache
* APC
And: You DO want to install a opcode cache!
For example:
alt text http://blogs.interdose.com/dominik/wp-content/uploads/2008/04/opcode_wordpress.png
This is the duration it took to call the wordpress homepage 10.000 times.
Edit: BTW, eAccelerator contains an optimizer itself.
MMCache has been deprecated. I recommend either http://pecl.php.net/package/APC or http://xcache.lighttpd.net/, both of which also give you variable storage (like Memcache).
Both are interesting and will provide speed boost since they compile source code into binary representation which is then executed by the PHP engine.
Any huge web site running with PHP (Facebook for example) is running some sort of opcode cache system like MMCache.
The problem is that they are not very easy to set up depending on your system.
Depending on how much of your PHP code is actually executed and how long that execution takes they can be a really big win. It certainly isn't going to hurt, but the gain you see will very much depend on where your time is currently spent.
btw mmcache has been rolled into a different project now, I forget the name but Google will tell you.
I use APC on my production servers and it works pretty well out of the box. Compile it and add it to PHP and there isn't much tweaking left to do for it. I check it every once in a while just to review stats but since I use MVC a lot all of the main files (routers, controllers, etc) rarely change on a day-to-day basis so that code stays compiled and runs pretty efficiently.
currently we use apc, free and was just a simple plug and play on our live servers. Provided a huge performance increase for our site, especially as the project size increased. I also have the apc.stat disabled so it doesn't check if the code has been updated, so whenever we need to update the code on the live site we restart apache.
I use APC, and can attest that it can dramatically reduce the CPU and I/O load on an app server if you maintain a high cache-hit rate. It not only saves you from having to compile, it can save you from having to read the php files from disk at all. (i.e. the bytecodes are served directly from main memory, so it's super fast) It lowers the speed to render a single page, and increases the requests per second your server can handle.
If you use RedHat or CentOS, installing APC is super simple:
yum install php-devel httpd-devel php-pear
pecl install apc
echo "extension=apc.so" > /etc/php.d/apc.ini
# if you're using SELinux:
chcon "system_u:object_r:textrel_shlib_t" /usr/lib/php/modules/apc.so
/etc/init.d/httpd restart
You asked about downsides. The only downside is that it requires some memory. The default on APC is 30MB, but it can be adjusted, and the cost of a little bit of memory more than pays for itself with the increased speed and response rate.
BlaM's testing included all the DB calls made by WordPress. When you're making fewer DB calls, you'll see the performance gain of opcode caches be even more dramatic.
I used Zend Accelerator a little back in the day (2004-ish). It certainly gave some significant performance wins on code it could work with, but unfortunately the system I was using was designed to quite often dynamically load code and then eval it, which Zend Accelerator couldn't do much with at the time (and I'd guess still can't).
On the down side, we certainly saw some caching issues (where the code would be changes, but the compiled version sync with the change for one reason or another). I imagine those problems have likely been ironed out by now.
Anyway, I don't have any hard comparison numbers, and certainly didn't write the same system in different environments for comparison, but for the vast majority of systems, PHP isn't going to kill you performance wise.
Have you checked out Phalanger? It compiles PHP to .NET code. Here are some benchmarks which show that it can dramatically improve performance.