Greetings everyone
I am working on a small crawling engine and am using curl to request pages from various websites. Question is what do suggest should I set my connection_timeout and timeout values to? Stuff I would normally be crawling would be pages with lots of images and text.
cURL knows two different timeouts.
For CURLOPT_CONNECTTIMEOUT it doesn't matter how much text the site contains or how many other resources like images it references because this is a connection timeout and even the server cannot know about the size of the requested page until the connection is established.
For CURLOPT_TIMEOUT it does matter. Even large pages require only a few packets on the wire, but the server may need more time to assemble the output. Also the number of redirects and other things (e.g. proxies) can significantly increase response time.
Generally speaking the "best value" for timeouts depends on your requirements and conditions of the networks and servers. Those conditions are subject of change. Therefore there is no "one best value".
I recommend to use rather short timeouts and retry failed downloads later.
Btw cURL does not automatically download resources referenced in the response. You have to do this manually with further calls to curl_exec (with fresh timeouts).
If you set it too high then your script will be slow as a one url that is down will take all the time you set in CURLOPT_TIMEOUT to finish processing. If you are not using proxies then you can just set the following values
CURLOPT_TIMEOUT = 3
CURLOPT_CONNECTTIMEOUT = 1
Then you can go through failed urls at a later time to double check on them.
The best response is the rik's one.
I have a Proxy Checker and in my benchmarks I saw that most of working Proxies takes less than 10 seconds to connect.
So I use 10 seconds for ConnectionTimeOut and TimeOut but that's in my case, you have to decide how many time you want to use so start with big values, use curl_getinfo to see time benchmarks and decrease the value.
Note: A proxy that takes more than 5 or 10 seconds to connect is useless for me, that's why I use that values.
Yes. If your target is a proxy to query another site, such a cascading connection will require fairly long period like these values to execute the curl calls.
Especially when you encountered intermittent curl problems, please check these values first.
I use
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT,30);
curl_setopt($ch, CURLOPT_TIMEOUT,60);
Related
I have a PHP website and one of the pages I use makes a CURL call to another server. Now this server need about 45 seconds to respond, and there is nothing I can do about it.There is actually 2 step to get the information, the first step is to send the request to update the information (this takes about 43 seconds) and after I need to send another request to get the data back (normally takes 2-5 sec).
My server is on GoDaddy and obviously sometimes it timeout (CGI Timeout) because I think it's normally 30 seconds.
This script (asking the request + getting the data back), is normally triggered overnight via cron job however it can be triggered during the day.
So I was wondering: what would be the best way to split the information to avoid timeout issues?
I was thinking of just sending theupdate request and don't care about the result. Then, about a minute after, I would send a request to get back the data. However, I have no idea if it's even possible to do a timer in PHP, and if so, would the page timeout anyways?
Thanks!
You can set a timeout value in your PHP code to allow more time.
Setting Curl's Timeout in PHP
If you want to run the files separately, I would set up a separate cron job for the second file.
Use CURLOPT_CONNECTTIMEOUT to increase server response time.
CURLOPT_CONNECTTIMEOUT
The number of seconds to wait while trying to connect. Use 0 to wait indefinitely.
And then you need to use CURLOPT_TIMEOUT to get working the option CURLOPT_CONNECTTIMEOUT.
Something like this,
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT ,0); // 0 for wait infinitely, not a good practice
curl_setopt($ch, CURLOPT_TIMEOUT, 400); //in seconds
You can set it in micro seconds as well , like so,
CURLOPT_TIMEOUT_MS
I am trying to code a crawler based on PHP with curl. I have database of 20,000-30,000 URLs that I have to crawl. Each call to curl to fetch a webpage takes around 4-5 seconds.
How can I optimize this and reduce the time required to fetch a page?
You can use curl_multi_* for that. The amount of curl resources you append to one multi handle is the amount of parallel requests it will do. I usually start with 20-30 threads, depending on the size of returned content (make sure your script won't terminate on memory limit).
Note, that it will run as long as it takes to run the slowest request. So if a request times out, you might wait for very long. To avoid that, it can be a good idea to set timeout to some acceptable value.
You can see the code example at my answer in another thread here.
I need get some data from remote http server.Im using Curl Classes for multirequests.
My problem is Remote Server's Firewall. Im sending 1000 between 10000 GET and POST requests. And Server bans me from DDOS.
İ used this measures.
packages still contain header information
curl_setopt($this->ch, CURLOPT_HTTPHEADER, $header);
packages still contain random referer information
curl_setopt($this->ch, CURLOPT_REFERER, $refs[rand(0,count($refs))]);
packages still contain random user agents
curl_setopt($this->ch, CURLOPT_USERAGENT, $agents[rand(0,count($agents))]);
I send packages by using the function of sleep at random intervals.
sleep(rand(0,10));
But bans access to the server each time for 1 hour.
Sorry for my bad english :)
Thanks for all.
Sending a large number of requests in a short space of time to the server is likely to have the same impact as a DOS attack whether that is what you intended or not. A quick fix would be to change the sleep line from sleep(rand(0,10)); which means there is a 1 in 11 chance of sending the next request instantly to sleep(3); which means there will always be 3 seconds (approximately) between requests. 3 seconds should be enough of a gap to keep most servers happy. Once you've verified this works you can reduce the value to 2 or 1 to see if you can speed things up.
A far better solution would be to create an API on the server that allows you to get the data you need in 1, or at least only a few, requests. Obviously this is only possible if you're able to make changes to the server (or can persuade those who can to make the changes on your behalf).
Currently I'm using file_get_contents() to submit GET data to an array of sites, but upon execution of the page I get this error:
Fatal error: Maximum execution time of 30 seconds exceeded
All I really want the script to do is start loading the webpage, and then leave. Each webpage may take up to 5 minutes to load fully, and I don't need it to load fully.
Here is what I currently have:
foreach($sites as $s) //Create one line to read from a wide array
{
file_get_contents($s['url']); // Send to the shells
}
EDIT: To clear any confusion, this script is being used to start scripts on other servers, that return no data.
EDIT: I'm now attempting to use cURL to do the trick, by setting a timeout of one second to make it send the data and then stop. Here is my code:
$ch = curl_init($s['url']); //load the urls
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 1); //Only send the data, don't wait.
curl_exec($ch); //Execute
curl_close($ch); //Close it off.
Perhaps I've set the option wrong. I'm looking through some manuals as we speak. Just giving you an update. Thank you all of you that are helping me thus far.
EDIT: Ah, found the problem. I was using CURLOPT_CONNECTTIMEOUT instead of CURLOPT_TIMEOUT. Whoops.
However now, the scripts aren't triggering. They each use ignore_user_abort(TRUE); so I can't understand the problem
Hah, scratch that. Works now. Thanks a lot everyone
There are many ways to solve this.
You could use cURL with its curl_multi_* functions to execute asynchronously the requests. Or use cURL the common way but using 1 as timeout limit, so it will request and return timeout, but the request will be executed.
If you don't have cURL installed, you could continue using file_get_contents but forking processes (not so cool, but works) using something like ZendX_Console_Process_Unix so you avoid the waiting between each request.
As Franco mentioned and I'm not sure was picked up on, you specifically want to use the curl_multi functions, not the regular curl ones. This packs multiple curl objects into a curl_multi object and executes them simultaneously, returning (or not, in your case) the responses as they arrive.
Example at http://php.net/curl_multi_init
Re your update that you only need to trigger the operation:
You could try using file_get_contents with a timeout. This would lead to the remote script being called, but the connection being terminated after n seconds (e.g. 1).
If the remote script is configured so it continues to run even if the connection is aborted (in PHP that would be ignore_user_abort), it should work.
Try it out. If it doesn't work, you won't get around increasing your time_limit or using an external executable. But from what you're saying - you just need to make the request - this should work. You could even try to set the timeout to 0 but I wouldn't trust that.
From here:
<?php
$ctx = stream_context_create(array(
'http' => array(
'timeout' => 1
)
)
);
file_get_contents("http://example.com/", 0, $ctx);
?>
To be fair, Chris's answer already includes this possibility: curl also has a timeout switch.
it is not file_get_contents() who consume that much time but network connection itself.
Consider not to submit GET data to an array of sites, but create an rss and let them get RSS data.
I don't fully understands the meaning behind your script.
But here is what you can do:
In order to avoid the fatal error quickly you can just add set_time_limit(120) at the beginning of the file. This will allow the script to run for 2 minutes. Of course you can use any number that you want and 0 for infinite.
If you just need to call the url and you don't "care" for the result you should use cUrl in asynchronous mode. This case any call to the URL will not wait till it finished. And you can call them all very quickly.
BR.
If the remote pages take up to 5 minutes to load, your file_get_contents will sit and wait for that 5 minutes. Is there any way you could modify the remote scripts to fork into a background process and do the heavy processing there? That way your initial hit will return almost immediately, and not have to wait for the startup period.
Another possibility is to investigate if a HEAD request would do the trick. HEAD does not return any data, just headers, so it may be enough to trigger the remote jobs and not wait for the full output.
I'm using a 'rolling' cURL multi implementation (like this SO post, based on this cURL code). It works fine to process thousands of URLs using up to 100 requests at the same time, with 5 instances of the script running as daemons (yeah, I know, this should be written in C or something).
Here's the problem: after processing ~200,000 urls (across the 5 instances) curl_multi_exec() seems to break for all instances of the script. I've tried shutting down the scripts, then restarting, and the same thing happens (not after 200,000 urls, but right on restart), the script hangs calling curl_multi_exec().
I put the script into 'single' mode, processing one regular cURL handle at time, and that works fine (but it's not quite the speed I need). My logging leads me to suspect that it may have hit a patch of slow/problematic connections (since every so often it seems to process on URL then hang again), but that would mean my CURLOPT_TIMEOUT is being ignored for the individual handles. Or maybe it's just something with running that many requests through cURL.
Anyone heard of anything like this?
Sample code (again based on this):
//some logging shows it hangs right here, only looping a time or two
//so the hang seems to be in the curl call
while(($execrun =
curl_multi_exec($master, $running)) == CURLM_CALL_MULTI_PERFORM);
//code to check for error or process whatever returned
I have CURLOPT_TIMEOUT set to 120, but in the cases where curl_multi_exec() finally returns some data, it's after 10 minutes of waiting.
I have a bunch of testing/checking yet to do, but thought maybe this might ring a bell with someone.
After much testing, I believe I've found what is causing this particular problem. I'm not saying the other answer is incorrect, just in this case not the issue I am having.
From what I can tell, curl_multi_exec() does not return until all DNS (failure or success) is resolved. If there are a bunch of urls with bad domains curl_multi_exec() doesn't return for at least:
(time it takes to get resolve error) * (number of urls with bad domain)
Here's someone else who has discovered this:
Just a note on the asynchronous nature of cURL’s multi functions: the DNS lookups are not (as far as I know today) asynchronous. So if one DNS lookup of your group fails, everything in the list of URLs after that fails also. We actually update our hosts.conf (I think?) file on our server daily in order to get around this. It gets the IP addresses there instead of looking them up. I believe it’s being worked on, but not sure if it’s changed in cURL yet.
Also, testing shows that cURL (at least my version) does follow the CURLOPT_CONNECTTIMEOUT setting. Of course the first step of a multi cycle may still take a long time, since cURL waits for every url to resolve or timeout.
I think your problem is releated to:
(62) CURLOPT_TIMEOUT does not work properly with the regular multi and multi_socket interfaces. The work-around for apps is to simply remove the easy handle once the time is up.
See also: http://curl.haxx.se/bug/view.cgi?id=2501457
If that is the case you should watch your curl handles for timeouts and remove them from the multi pool.