Optimize PHP CURL for web crawler

Optimize PHP CURL for web crawler - php

I am trying to code a crawler based on PHP with curl. I have database of 20,000-30,000 URLs that I have to crawl. Each call to curl to fetch a webpage takes around 4-5 seconds.
How can I optimize this and reduce the time required to fetch a page?

You can use curl_multi_* for that. The amount of curl resources you append to one multi handle is the amount of parallel requests it will do. I usually start with 20-30 threads, depending on the size of returned content (make sure your script won't terminate on memory limit).
Note, that it will run as long as it takes to run the slowest request. So if a request times out, you might wait for very long. To avoid that, it can be a good idea to set timeout to some acceptable value.
You can see the code example at my answer in another thread here.

Related

MultiCurl requires big timeout in order to fetch data from many URLs

I use this multi curl wrapper: https://github.com/php-curl-class/php-curl-class/
I'm looping through ~160 URLs and fetch XML data from them. As I realize the curl requests are done in parallel. The strange thing is that if I put small timeout (say, 10 seconds) more than half of URLs couldn't be handled: I received error callback with error message of Timeout was reached.
However if I set the timeout to be 100 seconds, almost all URLs are handled properly.
But I cannot understand why this happens. If I use a single Curl instance and fetch data from any of the URLs I got response pretty quickly, it doesn't require 100 seconds to fetch data from a single URL.
So the purpose of multi curl is doing requests in parallel. Every request has it's own timeout. Then, if the timeout is set to small value (10-20-30 seconds), then why it turns out that it's not enough?
Later I'll have ~600 URLs, which would mean that the timeout probably should be increased to 400-500 seconds, which is weird. I could as well create a single Curl instance and do requests one by one with almost the same results

Curl and PHP can't async a request
instead of using fetch in a JavaScript way you got second third ... chance to do the job with promises. however to add this native function to php with curl you go somes ways already talked here Async curl request in PHP and PHP Curl async response

Execute Code in Parallel in PHP to minimize execution time

Initial Condition: I have code written in php file. initially i was executing code, it was taking 30 seconds to execute. In this file the code was called 5 times.
What will happen next:Let if i need to execute this code 50 times then it will take 300 seconds in one execution in browser.next for 500 times 3000 secs. So it is serial execution of code.
What I Need: i need to execute this code in parallel. like several instance. So i would like to minimize the execution time so user has not wait for such long time.
What I Did: i used PHP CURL to execute this code parallel. I called this file several times to minimize the execution time.
So I want to know that is this method is correct. How much CURL i can execute and how much resources it require. It need a better method that how could i execute this code in parallel with tutorial.
any help will be grateful.

Probably the simplest option without changing your code (too much), though, would be to call PHP through the command line and not CURL. This cuts the overhead of APACHE (both in memory and speed), networking etc. Plus Curl is not a portable option as some servers can't see themselves (in network terms).
$process1 = popen('php myfile.php [parameters]');
$process2 = popen('php myfile.php [parameters]');
// get response from children : you can loop until all completed
$response1 = stream_get_contents($process1);
$response2 = stream_get_contents($process2);
You'll need to remove any reference to apache added variables in $_SERVER, and replace $_GET with argv/argc references. Both otherwise it should just work.
But the best solution will probably be pThreads (http://php.net/manual/en/book.pthreads.php) that allow you to do what you want. Will require some editing of code (and installing, possibly) but does what you're asking.

php curl is low enough overhead to not have to worry about it. If you can make loopback calls to a server farm through a load balancer, that's a good use case for curl. I've also used pcntl_fork() for same-host parallelism, but it's harder to set up. I've written classes built on both; see my php lib at https://github.com/andrasq/quicklib for ideas (or just borrow code, it's open source)

Consider using Gearman. Documentation :
http://php.net/manual/en/book.gearman.php

comet style application with loops

Do all comet style applications require a loop somewhere in the application on the serverside to detect updates/changes? If no, please could you explain how the logic behind a loopless comet style application would work?

This kind of application will always require a loop, you need to periodically check for new data etc. Of course you can make the "loop" non-blocking by using an even-loop based approach, but in the end there's still a loop somewhere.
Just think about it for a moment, how would you make it work without a loop? I sure can't imagine a way that doesn't utilize a loop somewhere.

Short answer is, no, not all require a loop on the serverside.
Instead you can use long-polling AJAX calls from the browser to request data,
at which the server simply responds with the data and the browser waits until the response is gotten before sending a new request.

The solution could be stream_set_blocking. Use any possible blocking resource to be suspended by OS and wait for appropriate interruption.
Client side:
Ajax call to endpoint script (timeout for ajax e.g. 30 seconds - after 30 seconds initiate another one - after 30 seconds you will get response from server - script execution time reached)
If you will get response during 30 seconds handle response (async) and open new connection (as in comet done - I saw it in cometD client)
Server setup:
Setup apache timeouts (between request and data sent to 30-31 second), this is so apache will allow you to wait so much
set apache to allow lot of child instances (concurrent users * 1.5), but you need to be sure that you have enough memory for this amount of apache instances (+ memory used by php children)
Script one:
execution_time = 28
set shutdown_function in order to send response (timeout, but formatted and understandable for ajax if You need it)
you need to open file, empty one
enable blocking mode using stream_set_blocking for file stream
try read from file and you will get suspended until other process will write to file or timeout be reached.
As soon as script gets content in file written from other process it will get back and will send response. (this will trigger another ajax call and another slept process)
Worst thing is that you need to think how to get multiple reader scripts reading from same bus (file) without disturbing each other.
Also there could be that timeout will be exactly at that time when message will be written into bus.
(hope that this solution is not as bad as my English)

Faster alternative to file_get_contents()

Currently I'm using file_get_contents() to submit GET data to an array of sites, but upon execution of the page I get this error:
Fatal error: Maximum execution time of 30 seconds exceeded
All I really want the script to do is start loading the webpage, and then leave. Each webpage may take up to 5 minutes to load fully, and I don't need it to load fully.
Here is what I currently have:
foreach($sites as $s) //Create one line to read from a wide array
{
file_get_contents($s['url']); // Send to the shells
}
EDIT: To clear any confusion, this script is being used to start scripts on other servers, that return no data.
EDIT: I'm now attempting to use cURL to do the trick, by setting a timeout of one second to make it send the data and then stop. Here is my code:
$ch = curl_init($s['url']); //load the urls
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 1); //Only send the data, don't wait.
curl_exec($ch); //Execute
curl_close($ch); //Close it off.
Perhaps I've set the option wrong. I'm looking through some manuals as we speak. Just giving you an update. Thank you all of you that are helping me thus far.
EDIT: Ah, found the problem. I was using CURLOPT_CONNECTTIMEOUT instead of CURLOPT_TIMEOUT. Whoops.
However now, the scripts aren't triggering. They each use ignore_user_abort(TRUE); so I can't understand the problem
Hah, scratch that. Works now. Thanks a lot everyone

There are many ways to solve this.
You could use cURL with its curl_multi_* functions to execute asynchronously the requests. Or use cURL the common way but using 1 as timeout limit, so it will request and return timeout, but the request will be executed.
If you don't have cURL installed, you could continue using file_get_contents but forking processes (not so cool, but works) using something like ZendX_Console_Process_Unix so you avoid the waiting between each request.

As Franco mentioned and I'm not sure was picked up on, you specifically want to use the curl_multi functions, not the regular curl ones. This packs multiple curl objects into a curl_multi object and executes them simultaneously, returning (or not, in your case) the responses as they arrive.
Example at http://php.net/curl_multi_init

Re your update that you only need to trigger the operation:
You could try using file_get_contents with a timeout. This would lead to the remote script being called, but the connection being terminated after n seconds (e.g. 1).
If the remote script is configured so it continues to run even if the connection is aborted (in PHP that would be ignore_user_abort), it should work.
Try it out. If it doesn't work, you won't get around increasing your time_limit or using an external executable. But from what you're saying - you just need to make the request - this should work. You could even try to set the timeout to 0 but I wouldn't trust that.
From here:
<?php
$ctx = stream_context_create(array(
'http' => array(
'timeout' => 1
)
)
);
file_get_contents("http://example.com/", 0, $ctx);
?>
To be fair, Chris's answer already includes this possibility: curl also has a timeout switch.

it is not file_get_contents() who consume that much time but network connection itself.
Consider not to submit GET data to an array of sites, but create an rss and let them get RSS data.

I don't fully understands the meaning behind your script.
But here is what you can do:
In order to avoid the fatal error quickly you can just add set_time_limit(120) at the beginning of the file. This will allow the script to run for 2 minutes. Of course you can use any number that you want and 0 for infinite.
If you just need to call the url and you don't "care" for the result you should use cUrl in asynchronous mode. This case any call to the URL will not wait till it finished. And you can call them all very quickly.
BR.

If the remote pages take up to 5 minutes to load, your file_get_contents will sit and wait for that 5 minutes. Is there any way you could modify the remote scripts to fork into a background process and do the heavy processing there? That way your initial hit will return almost immediately, and not have to wait for the startup period.
Another possibility is to investigate if a HEAD request would do the trick. HEAD does not return any data, just headers, so it may be enough to trigger the remote jobs and not wait for the full output.

How to execute a PHP spider/scraper but without it timing out

Basically I need to get around max execution time.
I need to scrape pages for info at varying intervals, which means calling the bot at those intervals, to load a link form the database and scrap the page the link points to.
The problem is, loading the bot. If I load it with javascript (like an Ajax call) the browser will throw up an error saying that the page is taking too long to respond yadda yadda yadda, plus I will have to keep the page open.
If I do it from within PHP I could probably extend the execution time to however long is needed but then if it does throw an error I don't have the access to kill the process, and nothing is displayed in the browser until the PHP execute is completed right?
I was wondering if anyone had any tricks to get around this? The scraper executing by itself at various intervals without me needing to watch it the whole time.
Cheers :)

Use set_time_limit() as such:
set_time_limit(0);
// Do Time Consuming Operations Here

"nothing is displayed in the browser until the PHP execute is completed"
You can use flush() to work around this:
flush()
(PHP 4, PHP 5)
Flushes the output buffers of PHP and whatever backend PHP is using (CGI, a web server, etc). This effectively tries to push all the output so far to the user's browser.

take a look at how Sphider (PHP Search Engine) does this.
Basically you will just process some part of the sites you need, do your thing, and go on to the next request if there's a continue=true parameter set.

run via CRON and split spider into chunks, so it will only do few chunks at once. call from CRON with different parameteres to process only few chunks.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Optimize PHP CURL for web crawler - php

I am trying to code a crawler based on PHP with curl. I have database of 20,000-30,000 URLs that I have to crawl. Each call to curl to fetch a webpage takes around 4-5 seconds. How can I optimize this and reduce the time required to fetch a page?

Related

MultiCurl requires big timeout in order to fetch data from many URLs

Execute Code in Parallel in PHP to minimize execution time

comet style application with loops

Faster alternative to file_get_contents()

How to execute a PHP spider/scraper but without it timing out

Categories

Resources