I have a small question about crawling a web page in PHP. I have to crawl about 90 000 products on one big eshop. I tried it in PHP, but one product takes about 2-3 sec and that's bad. Any tips, how to do it faster? Maybe a C++ multithread version? But what about time of a HTTP request? I mean, is it PHP's limitation or not? Thank you for the tips.
That's an extremely vague question. When you benchmarked the code you have, what was the slowest part? Was it network transfer times? Using a different language (or multiple threads) won't change that.
Was it time spent parsing the page? How are you doing that? If you're using an XML library to parse the entire DOM, could you get away with just looking for keywords (or even regular expressions)? That's less precise (and in some sense less correct) but perhaps it's faster.
What algorithms are you using for your analysis? Would other data structures provide better performance? As one simple example, if you spend a lot of time iterating over an array, perhaps a hash map is more appropriate.
PHP can be run in multiple processes. What happens if you kick off multiple instances of your script at once (on different pages)? Does the total time decrease?
Ultimately you've described a very general problem so I can't offer very specific solutions, but there is no inherent reason why PHP is inappropriate for this task. When you've identified what's slow (regardless of what language you're using) you should be able to more precisely address how to fix it.
I don't think it's PHPs problem but it could be depending on connection speed/computer speed. I've never had a speed problem with PHP/cURL though.
Just do multiple threads (ie. multiple connections at once), I suggest you use cURL but that's only because I'm familiar with it.
Here's a guide I've used for multiple threads for scraping with cURL:
http://semlabs.co.uk/journal/object-oriented-curl-class-with-multi-threading
Be VERY careful not to accidentally cause a denial of service situation with your scripts. But I'm sure you're already away of that possibility.
If your program is running slowly, my advice would be to run a profiler on it, and analyse why it's running slowly.
This advice applies to any language, but in the case of PHP, the profiler software you need is called xDebug.
This is a PHP extension, so you need to install it into your server. If you're running on an ISP's server, then you may not have permission to do this, but you can always install it with PHP on your local PC and run your tests there.
Once you've got xDebug installed, switch on the profiling features in PHP.ini (see the xDebug documentation for instruction on this), and run your program. It will then generate profiler files, which can be used to analyse what the program is doing.
Download KCacheGrind to perform the analysis. This will generate call tree information, showing exactly what happened as the program ran, and how long every function call took.
With this information, you can look for the function calls that are running slowly, and work out what's happening. Usually the reason for slow code is some kind of inefficiency in the way something is written; xDebug will help you find it.
Hope that helps.
You have 99% probability that PHP is NOT the problem. It is rather the eshop webserver or any other network latency.
I know this for sure because I have been doing this for months now, and even if your code has lots of regular expressions, data scraping is really fast in PHP.
The solution to speed this ? Pre cache all the website with a command line crawler since disk space is cheap. curl can do this, and httrack as well. It will be much faster and stable than PHP doing the crawling.
Then let PHP do the parsing alone, you will see hopefully PHP chomping dozens of pages per minute, hope this helps :)
I'm trying to track down issues with an application [modx] I have several of these sites [about 10] on my server & was wondering how I can see what php is doing.
Pages on these sites are extremely slow while the same sites in dev are fine as are other php applications on the server.
I tried using xdebug to get an idea of what php was doing while processing these pages & where the bottleneck was occurring, but it only appeared to want to do anything on an error [there are no errors being thrown]
Any suggestions on how to track this down?
[linux/Centos5/php5.2.?/apache2]
Xdebug and webgrind are a nice way to see where your bottel necks are...
Read XDEBUG_PROFILE and Webgrind
Set up the php.ini to have xdebug profile your code on every run or if a special param is passed, then setup webgrind to read from the same directory xdebug writes its profile dumps to.
Webgrind will show you what functions and set of functions require the most time, it breaks it down and makes it easy to find slow and/or inefficient code. (eg. your script is calling "PDOStatement->execute" 300 times on a fast query [Or calling it once and a massively slow one] taking up 90% of the execution time).
The most commonly used tool, for finding bottlenecks in PHP, would be Xdebug. But you should also manually examine the codebase.
There are three different areas where you will have to focus on:
frontend performance
SQL queries
php logic itself
.. and the impact on the perceived speed is in this order.
You should start by running ySlow, and make sure that your site follows the guidelines as closely as possible.
The next step would be tracking down what SQL queries are executed, and (assuming you are using mysql) try to run them with EXPLAIN. Also, check the queries themselves. There might be some extremely stupid code there, like ORDER BY RAND() or use of LIKE in huge tables.
And the last stage would fixing it all would a hard looks at the code itself. Both on PHP and JavaScript side of things.
Also , you should upgrade to PHP 5.3, because your version is extremely outdated.
Usually when you don't know what you're looking for, you cannot spot it with tools like xdebug or other plugins/debug bars etc built into CMS/Framework, new relic is the simplest solution - you'll be able to spot bottlenecks after few min.
while new relic is a paid app, you can test if for free for first 14 days - it's more than enough to find problem.
It's great because it integrates all other tool's and data sources you usually use:
xdebug, cpu & i/o monitoring, mysql slowlog, queries log.
It will also show you if your app is slow on php/DB/frontend/network.
You should try it out instead of wasting time for debugging with other tools.
here is a guide for centos installation: https://newrelic.com/docs/php/php-agent-installation-redhat-and-centos
How can I find out whether a PHP script goes bad and runs really slow when ran by hundreds of users every second, and better yet, is there any tool that could tell me approximately which part of the code slows me down?
...
I don't wish to post the code here (mainly because this question refers to something else and because it's a waste of space) and preferably never post it anywhere because it's actually a mess!... a mess that I understand and yes, i coded it, but still a mess which would insult anyone trying to comprehend it... so if you have any creative ideas, please let me know!
Cheers!
( thank you already for your incoming answers! )
Enable XDebug profiling, and send the resulting files through WinCacheGrind (Windows) or KCacheGrind (Linux).
This will allow you to see a breakdown of which functions get called most, and where the time is spent. Learning to use XDebug is a must for any serious PHP developer.
Here is a seemingly good tutorial on getting started with XDebug profiling.
You will need two tools
a profiler (Google it)
i use this one at work :
http://www.nusphere.com/products/php_profiler.htm (commercial)
a load tester
check this site for more info :
http://performance-testing.org/content/performance-testing-tools
I'd recommend to use a PHP profiler. Xdebug which is both PHP debugger and profiler can help a lot. There are also other debuggers, e.g. Zend Debugger.
To analyze profiling results you could also need a special tool. I used WinCacheGrind in Windows and KCachegrind in Linux.
Profiling report shows tons of useful information e.g. which lines of the source code were called how many times and which functions took the most of the execution time.
Some times Apache's CPU usage goes high on my server... recently, I saw something like 20 PHP process running at same time.
I need a tool to see what script each instance is running - something better than apache server-status.
A friend told me to use Zabbix.
I need see what scripts are running... that's all!
I think it would be better to use an code tracer in PHP to see what causes a high load in Apache. Check Xdebug.org
The Apache log will tell you what URLs have been accessed, as well as other useful information such as the response time for each request.
If you're concerned that certain pages are running slowly, use an Apache log analysis tool to filter the hits by response time. This should give you a fairly clear idea of which page(s) are causing a problem for you. There are any number of tools available to analyse Apache logs, ranging from web-based Analytics tools aimed at tracking your visitor demographics to more technical analysis tools. I can't really recommend any one in particular, so I'll just suggest using google for this. You'll get plenty of results.
Once you know which pages to investigate, you should then try profiling the pages in question, to see which functions are causing the bottleneck. XDebug is the de-facto tool for this with PHP. (It is a full PHP debugger, complete with the ability to step through code line-by-line, and integration into most of the popular IDEs.
A site I built with Kohana was slammed with an enormous amount of traffic yesterday, causing me to take a step back and evaluate some of the design. I'm curious what are some standard techniques for optimizing Kohana-based applications?
I'm interested in benchmarking as well. Do I need to setup Benchmark::start() and Benchmark::stop() for each controller-method in order to see execution times for all pages, or am I able to apply benchmarking globally and quickly?
I will be using the Cache-library more in time to come, but I am open to more suggestions as I'm sure there's a lot I can do that I'm simply not aware of at the moment.
What I will say in this answer is not specific to Kohana, and can probably apply to lots of PHP projects.
Here are some points that come to my mind when talking about performance, scalability, PHP, ...
I've used many of those ideas while working on several projects -- and they helped; so they could probably help here too.
First of all, when it comes to performances, there are many aspects/questions that are to consider:
configuration of the server (both Apache, PHP, MySQL, other possible daemons, and system); you might get more help about that on ServerFault, I suppose,
PHP code,
Database queries,
Using or not your webserver?
Can you use any kind of caching mechanism? Or do you need always more that up to date data on the website?
Using a reverse proxy
The first thing that could be really useful is using a reverse proxy, like varnish, in front of your webserver: let it cache as many things as possible, so only requests that really need PHP/MySQL calculations (and, of course, some other requests, when they are not in the cache of the proxy) make it to Apache/PHP/MySQL.
First of all, your CSS/Javascript/Images -- well, everything that is static -- probably don't need to be always served by Apache
So, you can have the reverse proxy cache all those.
Serving those static files is no big deal for Apache, but the less it has to work for those, the more it will be able to do with PHP.
Remember: Apache can only server a finite, limited, number of requests at a time.
Then, have the reverse proxy serve as many PHP-pages as possible from cache: there are probably some pages that don't change that often, and could be served from cache. Instead of using some PHP-based cache, why not let another, lighter, server serve those (and fetch them from the PHP server from time to time, so they are always almost up to date)?
For instance, if you have some RSS feeds (We generally tend to forget those, when trying to optimize for performances) that are requested very often, having them in cache for a couple of minutes could save hundreds/thousands of request to Apache+PHP+MySQL!
Same for the most visited pages of your site, if they don't change for at least a couple of minutes (example: homepage?), then, no need to waste CPU re-generating them each time a user requests them.
Maybe there is a difference between pages served for anonymous users (the same page for all anonymous users) and pages served for identified users ("Hello Mr X, you have new messages", for instance)?
If so, you can probably configure the reverse proxy to cache the page that is served for anonymous users (based on a cookie, like the session cookie, typically)
It'll mean that Apache+PHP has less to deal with: only identified users -- which might be only a small part of your users.
About using a reverse-proxy as cache, for a PHP application, you can, for instance, take a look at Benchmark Results Show 400%-700% Increase In Server Capabilities with APC and Squid Cache.
(Yep, they are using Squid, and I was talking about varnish -- that's just another possibility ^^ Varnish being more recent, but more dedicated to caching)
If you do that well enough, and manage to stop re-generating too many pages again and again, maybe you won't even have to optimize any of your code ;-)
At least, maybe not in any kind of rush... And it's always better to perform optimizations when you are not under too much presure...
As a sidenote: you are saying in the OP:
A site I built with Kohana was slammed with
an enormous amount of traffic yesterday,
This is the kind of sudden situation where a reverse-proxy can literally save the day, if your website can deal with not being up to date by the second:
install it, configure it, let it always -- every normal day -- run:
Configure it to not keep PHP pages in cache; or only for a short duration; this way, you always have up to date data displayed
And, the day you take a slashdot or digg effect:
Configure the reverse proxy to keep PHP pages in cache; or for a longer period of time; maybe your pages will not be up to date by the second, but it will allow your website to survive the digg-effect!
About that, How can I detect and survive being “Slashdotted”? might be an interesting read.
On the PHP side of things:
First of all: are you using a recent version of PHP? There are regularly improvements in speed, with new versions ;-)
For instance, take a look at Benchmark of PHP Branches 3.0 through 5.3-CVS.
Note that performances is quite a good reason to use PHP 5.3 (I've made some benchmarks (in French), and results are great)...
Another pretty good reason being, of course, that PHP 5.2 has reached its end of life, and is not maintained anymore!
Are you using any opcode cache?
I'm thinking about APC - Alternative PHP Cache, for instance (pecl, manual), which is the solution I've seen used the most -- and that is used on all servers on which I've worked.
See also: Slides APC Facebook,
Or Benchmark Results Show 400%-700% Increase In Server Capabilities with APC and Squid Cache.
It can really lower the CPU-load of a server a lot, in some cases (I've seen CPU-load on some servers go from 80% to 40%, just by installing APC and activating it's opcode-cache functionality!)
Basically, execution of a PHP script goes in two steps:
Compilation of the PHP source-code to opcodes (kind of an equivalent of JAVA's bytecode)
Execution of those opcodes
APC keeps those in memory, so there is less work to be done each time a PHP script/file is executed: only fetch the opcodes from RAM, and execute them.
You might need to take a look at APC's configuration options, by the way
there are quite a few of those, and some can have a great impact on both speed / CPU-load / ease of use for you
For instance, disabling [apc.stat](https://php.net/manual/en/apc.configuration.php#ini.apc.stat) can be good for system-load; but it means modifications made to PHP files won't be take into account unless you flush the whole opcode-cache; about that, for more details, see for instance To stat() Or Not To stat()?
Using cache for data
As much as possible, it is better to avoid doing the same thing over and over again.
The main thing I'm thinking about is, of course, SQL Queries: many of your pages probably do the same queries, and the results of some of those is probably almost always the same... Which means lots of "useless" queries made to the database, which has to spend time serving the same data over and over again.
Of course, this is true for other stuff, like Web Services calls, fetching information from other websites, heavy calculations, ...
It might be very interesting for you to identify:
Which queries are run lots of times, always returning the same data
Which other (heavy) calculations are done lots of time, always returning the same result
And store these data/results in some kind of cache, so they are easier to get -- faster -- and you don't have to go to your SQL server for "nothing".
Great caching mechanisms are, for instance:
APC: in addition to the opcode-cache I talked about earlier, it allows you to store data in memory,
And/or memcached (see also), which is very useful if you literally have lots of data and/or are using multiple servers, as it is distributed.
of course, you can think about files; and probably many other ideas.
I'm pretty sure your framework comes with some cache-related stuff; you probably already know that, as you said "I will be using the Cache-library more in time to come" in the OP ;-)
Profiling
Now, a nice thing to do would be to use the Xdebug extension to profile your application: it often allows to find a couple of weak-spots quite easily -- at least, if there is any function that takes lots of time.
Configured properly, it will generate profiling files that can be analysed with some graphic tools, such as:
KCachegrind: my favorite, but works only on Linux/KDE
Wincachegrind for windows; it does a bit less stuff than KCacheGrind, unfortunately -- it doesn't display callgraphs, typically.
Webgrind which runs on a PHP webserver, so works anywhere -- but probably has less features.
For instance, here are a couple screenshots of KCacheGrind:
(source: pascal-martin.fr)
(source: pascal-martin.fr)
(BTW, the callgraph presented on the second screenshot is typically something neither WinCacheGrind nor Webgrind can do, if I remember correctly ^^ )
(Thanks #Mikushi for the comment) Another possibility that I haven't used much is the the xhprof extension : it also helps with profiling, can generate callgraphs -- but is lighter than Xdebug, which mean you should be able to install it on a production server.
You should be able to use it alonside XHGui, which will help for the visualisation of data.
On the SQL side of things:
Now that we've spoken a bit about PHP, note that it is more than possible that your bottleneck isn't the PHP-side of things, but the database one...
At least two or three things, here:
You should determine:
What are the most frequent queries your application is doing
Whether those are optimized (using the right indexes, mainly?), using the EXPLAIN instruction, if you are using MySQL
See also: Optimizing SELECT and Other Statements
You can, for instance, activate log_slow_queries to get a list of the requests that take "too much" time, and start your optimization by those.
whether you could cache some of these queries (see what I said earlier)
Is your MySQL well configured? I don't know much about that, but there are some configuration options that might have some impact.
Optimizing the MySQL Server might give you some interesting informations about that.
Still, the two most important things are:
Don't go to the DB if you don't need to: cache as much as you can!
When you have to go to the DB, use efficient queries: use indexes; and profile!
And what now?
If you are still reading, what else could be optimized?
Well, there is still room for improvements... A couple of architecture-oriented ideas might be:
Switch to an n-tier architecture:
Put MySQL on another server (2-tier: one for PHP; the other for MySQL)
Use several PHP servers (and load-balance the users between those)
Use another machines for static files, with a lighter webserver, like:
lighttpd
or nginx -- this one is becoming more and more popular, btw.
Use several servers for MySQL, several servers for PHP, and several reverse-proxies in front of those
Of course: install memcached daemons on whatever server has any amount of free RAM, and use them to cache as much as you can / makes sense.
Use something "more efficient" that Apache?
I hear more and more often about nginx, which is supposed to be great when it comes to PHP and high-volume websites; I've never used it myself, but you might find some interesting articles about it on the net;
for instance, PHP performance III -- Running nginx.
See also: PHP-FPM - FastCGI Process Manager, which is bundled with PHP >= 5.3.3, and does wonders with nginx.
Well, maybe some of those ideas are a bit overkill in your situation ^^
But, still... Why not study them a bit, just in case ? ;-)
And what about Kohana?
Your initial question was about optimizing an application that uses Kohana... Well, I've posted some ideas that are true for any PHP application... Which means they are true for Kohana too ;-)
(Even if not specific to it ^^)
I said: use cache; Kohana seems to support some caching stuff (You talked about it yourself, so nothing new here...)
If there is anything that can be done quickly, try it ;-)
I also said you shouldn't do anything that's not necessary; is there anything enabled by default in Kohana that you don't need?
Browsing the net, it seems there is at least something about XSS filtering; do you need that?
Still, here's a couple of links that might be useful:
Kohana General Discussion: Caching?
Community Support: Web Site Optimization: Maximum Website Performance using Kohana
Conclusion?
And, to conclude, a simple thought:
How much will it cost your company to pay you 5 days? -- considering it is a reasonable amount of time to do some great optimizations
How much will it cost your company to buy (pay for?) a second server, and its maintenance?
What if you have to scale larger?
How much will it cost to spend 10 days? more? optimizing every possible bit of your application?
And how much for a couple more servers?
I'm not saying you shouldn't optimize: you definitely should!
But go for "quick" optimizations that will get you big rewards first: using some opcode cache might help you get between 10 and 50 percent off your server's CPU-load... And it takes only a couple of minutes to set up ;-) On the other side, spending 3 days for 2 percent...
Oh, and, btw: before doing anything: put some monitoring stuff in place, so you know what improvements have been made, and how!
Without monitoring, you will have no idea of the effect of what you did... Not even if it's a real optimization or not!
For instance, you could use something like RRDtool + cacti.
And showing your boss some nice graphics with a 40% CPU-load drop is always great ;-)
Anyway, and to really conclude: have fun!
(Yes, optimizing is fun!)
(Ergh, I didn't think I would write that much... Hope at least some parts of this are useful... And I should remember this answer: might be useful some other times...)
Use XDebug and WinCacheGrind or WebCacheGrind to profile and analyze slow code execution.
(source: jokke.dk)
Profile code with XDebug.
Use a lot of caching. If your pages are relatively static, then reverse proxy might be the best way to do it.
Kohana is out of the box very very fast, except for the use of database objects. To quote Zombor "You can reduce memory usage by ensuring you are using the database result object instead of result arrays." This makes a HUGEE performance difference on a site that is being slammed. Not only does it use more memory, it slows down execution of scripts.
Also - you must use caching. I prefer memcache and use it in my models like this:
public function get($e_id)
{
$event_data = $this->cache->get('event_get_'.$e_id.Kohana::config('config.site_domain'));
if ($event_data === NULL)
{
$this->db_slave
->select('e_id,e_name')
->from('Events')
->where('e_id', $e_id);
$result = $this->db_slave->get();
$event_data = ($result->count() ==1)? $result->current() : FALSE;
$this->cache->set('event_get_'.$e_id.Kohana::config('config.site_domain'), $event_data, NULL, 300); // 5 minutes
}
return $event_data;
}
This will also dramatically increase performance. The above two techniques improved a sites performance by 80%.
If you gave some more information about where you think the bottleneck is, I'm sure we could give some better ideas.
Also check out yslow (google it) for some other performance tips.
Strictly related to Kohana (you probably already have done this, or not):
In production mode:
Enable internal caching (this will only cache the Kohana::find_file results, but this actually can help a lot.
Disable profiler
Just my 2 cents :)
I totally agree with the XDebug and caching answers. Don't look into the Kohana layer for optimization until you've identified your biggest speed and scale bottlenecks.
XDebug will tell you were you spend the most of your time and identify 'hotspots' in your code. Keep this profiling information so you can baseline and measure performance improvements.
Example problem and solution:
If you find that you're building up expensive objects from the database each time, that don't really change often, then you can look at caching them with memcached or another mechanism. All of these performance fixes take time and add complexity to your system, so be sure of your bottlenecks before you start fixing them.