This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
cURL Mult Simultaneous Requests (domain check)
I'm trying to check to see if a website exists. (if it responds that's good enough) The issue is my array of domains is 20,000 and I'm trying to speed up the process as much as possible.
I've done some research and come across this page which details simultaneous cURL requests ->
http://www.phpied.com/simultaneuos-http-requests-in-php-with-curl/
I also found this page which seems be a good way of checking if a domain webpage is up -> http://www.wrichards.com/blog/2009/05/php-check-if-a-url-exists-with-curl/
Any ideas on how to quickly check 20,000 domains to see if they are up?
$http = curl_init($url);
$result = curl_exec($http);
$http_status = curl_getinfo($http, CURLINFO_HTTP_CODE);
curl_close($http);
if($http_status == 200) // good here
check out RollingCurl
It allows you to execute multiple curl requests.
Here is an example:
require 'curl/RollingCurl.php';
require 'curl/RollingCurlGroup.php';
$rc = new RollingCurl('handle_response');
$rc->window_size = 2;
foreach($domain_array as $domain => $value)
{
$request = new RollingCurlRequest($value);
// echo $temp . "\n";
$rc->add($request);
}
$rc->execute();
function handle_response($response, $info)
{
if($info['http_code'] === 200)
{
// site exists handle response data
}
}
I think that if you really want to speed up the process and save a lot of bandwidth (as I got you plan to check the availability on a regular basis) then you should work with sockets, not with curl. You may open several sockets at time and arrange 'asynchronous' treatment of each socket. Then you need to send not the "GET $sitename/ HTTP/1.0\r\n\r\n" request but "HEAD $sitename/ HTTP/1.0\r\n\r\n". It will return the same status code as GET request would return but without response body. You need to parse only first row of response to get an answer, so you just could regex_match it with good response codes. And as one extra optimization, eventually your code will learn what sites are sitting on the same IPs, so you cache the name mappings and order the list by IP. Then you may check several sites over one connected socket for these sites (remember to add 'Connection: keep-alive' header).
YOu can use multi curl requests, but you probably want to limit them to 10 at a time or so. You would have to track jobs in a separate database for processing the queue: Threads in PHP
Related
I am trying to get page meta tags and description from given url .
I have url array that I have to loop through to send curl get request and get each page meta, this takes a lot of time to process .
Is there any way to process all urls simultaneuosly at same time?
I mean send request to all urls at same time and then receive
response as soon as request is completed respectively.
For this purpose I have used
curl_multi_init()
but its not working as expected. I have used this example
Simultaneuos HTTP requests in PHP with cURL
I have also used GuzzleHttp example
Concurrent HTTP requests without opening too many connections
my code
$urlData = [
'http://youtube.com',
'http://dailymotion.com',
'http://php.net'
];
foreach ($urlData as $url) {
$promises[] = $this->client->requestAsync('GET', $url);
}
Promise\all($promises)->then(function (array $responses) {
foreach ($responses as $response) {
$htmlData = $response->getBody();
dump($profile);
}
})->wait();
But I got this error
Call to undefined function GuzzleHttp\Promise\Promise\all()
I am using Guzzle 6 and Promises 1.3
I need a solution whether it is in curl or in guzzle to send simultaneous request to save time .
Check your use statements. You probably have a mistake there, because correct name is GuzzleHttp\Promise\all(). Maybe you forgot use GuzzleHttp\Promise as Promise.
Otherwise the code is correct and should work. Also check that you have cURL extension enabled in PHP, so Guzzle will use it as the backend. It's probably there already, but worth to check ;)
I want to have an HTTP GET request sent from PHP. Example:
http://tracker.example.com?product_number=5230&price=123.52
The idea is to do server-side web-analytics: Instead of sending tracking
information from JavaScript to a server, the server sends tracking
information directly to another server.
Requirements:
The request should take as little time as possible, in order to not
noticeably delay processing of the PHP page.
The response from the tracker.example.com does not need to be
checked. As examples, some possible responses from
tracker.example.com:
200: That's fine, but no need to check that.
404: Bad luck, but - again - no need to check that.
301: Although a redirect would be appropriate, it would delay
processing of the PHP page, so don't do that.
In short: All responses can be discarded.
Ideas for solutions:
In a now deleted answer, someone suggested calling command line
curl from PHP in a shell process. This seems like a good idea,
only that I don't know if forking a lot of shell processes under
heavy load is a wise thing to do.
I found php-ga, a package for doing server-side Google
Analytics from PHP. On the project's page, it is
mentioned: "Can be configured to [...] use non-blocking requests."
So far I haven't found the time to investigate what method php-ga
uses internally, but this method could be it!
In a nutshell: What is the best solution to do generic server-side
tracking/analytics from PHP.
Unfortunately PHP by definition is blocking. While this holds true for the majority of functions and operations you will normally be handling, the current scenario is different.
The process which I like to call HTTP-Ping, requires that you only touch a specific URI, forcing the specific server to boot-strap it's internal logic. Some functions allow you to achieve something very similar to this HTTP-ping, by not waiting for a response.
Take note that the process of pinging an url, is a two step process:
Resolve the DNS
Making the request
While making the request should be rather fast once the DNS is resolved and the connection is made, there aren't many ways of making the DNS resolve faster.
Some ways of doing an http-ping are:
cURL, by setting CONNECTION_TIMEOUT to a low value
fsockopen by closing immediately after writing
stream_socket_client (same as fsockopen) and also adding STREAM_CLIENT_ASYNC_CONNECT
While both cURL and fsockopen are both blocking while the DNS is being resolved. I have noticed that fsockopen is significantly faster, even in worst case scenarios.
stream_socket_client on the other hand should fix the problem regarding DNS resolving and should be the optimal solution in this scenario, but I have not managed to get it to work.
One final solution is to start another thread/process that does this for you. Making a system call for this should work, but also forking the current process should do that also. Unfortunately both are not really safe in applications where you can't control the environment on which PHP is running.
System calls are more often than not blocked and pcntl is not enabled by default.
I would call tracker.example.com this way:
get_headers('http://tracker.example.com?product_number=5230&price=123.52');
and in the tracker script:
ob_end_clean();
ignore_user_abort(true);
ob_start();
header("Connection: close");
header("Content-Length: " . ob_get_length());
ob_end_flush();
flush();
// from here the response has been sent. you can now wait as long as you want and do some tracking stuff
sleep(5); //wait 5 seconds
do_some_stuff();
exit;
I implemented function for fast GET request to url without waiting for response:
function fast_request($url)
{
$parts=parse_url($url);
$fp = fsockopen($parts['host'],isset($parts['port'])?$parts['port']:80,$errno, $errstr, 30);
$out = "GET ".$parts['path']." HTTP/1.1\r\n";
$out.= "Host: ".$parts['host']."\r\n";
$out.= "Content-Length: 0"."\r\n";
$out.= "Connection: Close\r\n\r\n";
fwrite($fp, $out);
fclose($fp);
}
We were using fsockopen and fwrite combo, then it up and stopped working one day. Or it was kind of intermittent. After a little research and testing, and if you have fopen wrappers enabled, I ended up using file_get_contents and stream_context_create functions with a timeout that is set to 100th of second. The timeout parameter can receive floating values (https://www.php.net/manual/en/context.http.php). I wrapped it in a try...catch block so it would fail silently. It works beautifully for our purposes. You can do logging stuff in the catch if needed. The timeout is the key if you don't want the function to block runtime.
function fetchWithoutResponseURL( $url )
{
$context = stream_context_create([
"http" => [
"method"=>"GET",
"timeout" => .01
]
]
);
try {
file_get_contents($url, 0, $context);
}catch( Exception $e ){
// Fail silently
}
}
For those of you working with wordrpess as a backend -
it is as simple as:
wp_remote_get( $url, array(blocking=>false) );
Came here whilst researching a similar problem. If you have a database connection handy, one other possibility is to quickly stuff the request details into a table, and then have a seperate cron-based process that periodically scans that table for new records to process, and makes the tracking request, freeing up your web application from having to make the HTTP request itself.
You can use shell_exec, and command line curl.
For an example, see this question
You can actually do this using CURL directly.
I have both implemented it using a very short timeout (CURLOPT_TIMEOUT_MS) and/or using curl_multi_exec.
Be advised: eventually i quit this method because not every request was correctly made. This could have been caused by my own server though i haven't been able to rule out the option of curl failing.
I needed to do something similar, just ping a url and discard all responses. I used the proc_open command which lets you end the process right away using proc_close. I'm assuming you have lynx installed on your server:
<?php
function ping($url) {
$proc = proc_open("lynx $url",[],$pipes);
proc_close($proc);
}
?>
<?php
// Create a stream
$opts = array(
'http'=>array(
'method'=>"GET",
'header'=>"Accept-language: en"
)
);
$context = stream_context_create($opts);
// Open the file using the HTTP headers set above
$file = file_get_contents('http://tracker.example.com?product_number=5230&price=123.52', false, $context);
?>
I'm trying to make a PHP script that will check the HTTP status of a website as fast as possible.
I'm currently using get_headers() and running it in a loop of 200 random urls from mysql database.
To check all 200 - it takes an average of 2m 48s.
Is there anything I can do to make it (much) faster?
(I know about fsockopen - It can check port 80 on 200 sites in 20s - but it's not the same as requesting the http status code because the server may responding on the port - but might not be loading websites correctly etc)
Here is the code..
<?php
function get_httpcode($url) {
$headers = get_headers($url, 0);
// Return http status code
return substr($headers[0], 9, 3);
}
###
## Grab task and execute it
###
// Loop through task
while($data = mysql_fetch_assoc($sql)):
$result = get_httpcode('http://'.$data['url']);
echo $data['url'].' = '.$result.'<br/>';
endwhile;
?>
You can try CURL library. You can send multiple request parallel at same time with CURL_MULTI_EXEC
Example:
$ch = curl_init('http_url');
curl_setopt($ch, CURLOPT_HEADER, 1);
$c = curl_exec($ch);
$info = curl_getinfo($ch, CURLINFO_HTTP_CODE);
print_r($info);
UPDATED
Look this example. http://www.codediesel.com/php/parallel-curl-execution/
I don't know if this is an option that you can consider, but you could run all of them almost at the same using a fork, this way the script will take only a bit longer than one request
http://www.php.net/manual/en/function.pcntl-fork.php
you could add this in a script that is ran in cli mode and launch all the requests at the same time, for example
Edit: you say that you have 200 calls to make, so a thing you might experience is the database connection loss. the problem is caused by the fact that the link is destroyed when the first script completes. to avoid that you could create a new connection for each child. I see that you are using the standard mysql_* functions so be sure to pass the 4th parameter to be sure you create a new link each time. also check the maximum number of simultaneous connections on your server
My apologies, I've actually asked this question multiple times, but never quite understood the answers.
Here is my current code:
while($resultSet = mysql_fetch_array($SQL)){
$ch = curl_init($resultSet['url'] . $fullcurl); //load the urls and send GET data
curl_setopt($ch, CURLOPT_TIMEOUT, 2); //Only load it for two seconds (Long enough to send the data)
curl_exec($ch); //Execute the cURL
curl_close($ch); //Close it off
} //end while loop
What I'm doing here, is taking URLs from a MySQL Database ($resultSet['url']), appending some extra variables to it, just some GET data ($fullcurl), and simply requesting the pages. This starts the script running on those pages, and that's all that this script needs to do, is start those scripts. It doesn't need to return any output. Just the load the page long enough for the script to start.
However, currently it's loading each URL (currently 11) one at a time. I need to load all of them simultaneously. I understand I need to use curl_multi_, but I haven't the slightest idea on how cURL functions work, so I don't know how to change my code to use curl_multi_ in a while loop.
So my questions are:
How can I change this code to load all of the URLs simultaneously? Please explain it and not just give me code. I want to know what each individual function does exactly. Will curl_multi_exec even work in a while loop, since the while loop is just sending each row one at a time?
And of course, any references, guides, tutorials about cURL functions would be nice, as well. Preferably not so much from php.net, as while it does a good job of giving me the syntax, its just a little dry and not so good with the explanations.
EDIT: Okay zaf, here is my current code as of now:
$mh = curl_multi_init(); //set up a cURL multiple execution handle
$SQL = mysql_query("SELECT url FROM urls") or die(mysql_error()); //Query the shell table
while($resultSet = mysql_fetch_array($SQL)){
$ch = curl_init($resultSet['url'] . $fullcurl); //load the urls and send GET data
curl_setopt($ch, CURLOPT_TIMEOUT, 2); //Only load it for two seconds (Long enough to send the data)
curl_multi_add_handle($mh, $ch);
} //No more shells, close the while loop
curl_multi_exec($mh); //Execute the multi execution
curl_multi_close($mh); //Close it when it's finished.
In your while loop, you need to do the following for each URL:
create a curl resource by using curl_init()
set options for resource by curl_setopt(..)
Then you need to create a multiple curl handle by using curl_multi_init() and adding all the previous individual curl resources by using curl_multi_add_handle(...)
Then finally you can do curl_multi_exec(...).
A good example can be found here: http://us.php.net/manual/en/function.curl-multi-exec.php
Given a list of urls, I would like to check that each url:
Returns a 200 OK status code
Returns a response within X amount of time
The end goal is a system that is capable of flagging urls as potentially broken so that an administrator can review them.
The script will be written in PHP and will most likely run on a daily basis via cron.
The script will be processing approximately 1000 urls at a go.
Question has two parts:
Are there any bigtime gotchas with an operation like this, what issues have you run into?
What is the best method for checking the status of a url in PHP considering both accuracy and performance?
Use the PHP cURL extension. Unlike fopen() it can also make HTTP HEAD requests which are sufficient to check the availability of a URL and save you a ton of bandwith as you don't have to download the entire body of the page to check.
As a starting point you could use some function like this:
function is_available($url, $timeout = 30) {
$ch = curl_init(); // get cURL handle
// set cURL options
$opts = array(CURLOPT_RETURNTRANSFER => true, // do not output to browser
CURLOPT_URL => $url, // set URL
CURLOPT_NOBODY => true, // do a HEAD request only
CURLOPT_TIMEOUT => $timeout); // set timeout
curl_setopt_array($ch, $opts);
curl_exec($ch); // do it!
$retval = curl_getinfo($ch, CURLINFO_HTTP_CODE) == 200; // check if HTTP OK
curl_close($ch); // close handle
return $retval;
}
However, there's a ton of possible optimizations: You might want to re-use the cURL instance and, if checking more than one URL per host, even re-use the connection.
Oh, and this code does check strictly for HTTP response code 200. It does not follow redirects (302) -- but there also is a cURL-option for that.
Look into cURL. There's a library for PHP.
There's also an executable version of cURL so you could even write the script in bash.
I actually wrote something in PHP that does this over a database of 5k+ URLs. I used the PEAR class HTTP_Request, which has a method called getResponseCode(). I just iterate over the URLs, passing them to getResponseCode and evaluate the response.
However, it doesn't work for FTP addresses, URLs that don't begin with http or https (unconfirmed, but I believe it's the case), and sites with invalid security certificates (a 0 is not found). Also, a 0 is returned for server-not-found (there's no status code for that).
And it's probably easier than cURL as you include a few files and use a single function to get an integer code back.
fopen() supports http URI.
If you need more flexibility (such as timeout), look into the cURL extension.
Seems like it might be a job for curl.
If you're not stuck on PHP Perl's LWP might be an answer too.
You should also be aware of URLs returning 301 or 302 HTTP responses which redirect to another page. Generally this doesn't mean the link is invalid. For example, http://amazon.com returns 301 and redirects to http://www.amazon.com/.
Just returning a 200 response is not enough; many valid links will continue to return "200" after they change into porn / gambling portals when the former owner fails to renew.
Domain squatters typically ensure that every URL in their domains returns 200.
One potential problem you will undoubtably run into is when the box this script is running on looses access to the Internet... you'll get 1000 false positives.
It would probably be better for your script to keep some type of history and only report a failure after 5 days of failure.
Also, the script should be self-checking in some way (like checking a known good web site [google?]) before continuing with the standard checks.
You only need a bash script to do this. Please check my answer on a similar post here. It is a one-liner that reuses HTTP connections to dramatically improve speed, retries n times for temporary errors and follows redirects.