Optimising PHP cURL based link checker script - currently very slow

Optimising PHP cURL based link checker script - currently very slow - php

I'm using a PHP script (using cURL) to check whether:
The links in my database are correct (ie return HTTP status 200)
The links are in fact redirected and redirect to an appropriate/similar page (using the contents of the page )
The results of this are saved to a log file and emailed to me as an attachment.
This is all fine and working, however it is slow as all hell and half the time it times out and aborts itself early. Of note, I have about 16,000 links to check.
Was wondering how best to make this run quicker, and what I'm doing wrong?
Code below:
function echoappend ($file,$tobewritten) {
fwrite($file,$tobewritten);
echo $tobewritten;
}
error_reporting(E_ALL);
ini_set('display_errors', '1');
$filename=date('YmdHis') . "linkcheck.htm";
echo $filename;
$file = fopen($filename,"w+");
try {
$conn = new PDO('mysql:host=localhost;dbname=databasename',$un,$pw);
$conn->setAttribute(PDO::ATTR_ERRMODE, PDO::ERRMODE_EXCEPTION);
echo '<b>connected to db</b><br /><br />';
$sitearray = array("medical.posterous","ebm.posterous","behavenet","guidance.nice","www.rch","emedicine","www.chw","www.rxlist","www.cks.nhs.uk");
foreach ($sitearray as $key => $value) {
$site=$value;
echoappend ($file, "<h1>" . $site . "</h1>");
$q="SELECT * FROM link WHERE url LIKE :site";
$stmt = $conn->prepare($q);
$stmt->execute(array(':site' => 'http://' . $site . '%'));
$result = $stmt->fetchAll();
$totallinks = 0;
$workinglinks = 0;
foreach($result as $row)
{
$ch = curl_init();
$originalurl = $row['url'];
curl_setopt($ch, CURLOPT_URL, $originalurl);
curl_setopt($ch, CURLOPT_HEADER, 1);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_NOBODY, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, false);
$output = curl_exec($ch);
if ($output === FALSE) {
echo "cURL Error: " . curl_error($ch);
}
$urlinfo = curl_getinfo($ch);
if ($urlinfo['http_code'] == 200)
{
echoappend($file, $row['name'] . ": <b>working!</b><br />");
$workinglinks++;
}
else if ($urlinfo['http_code'] == 301 || 302)
{
$redirectch = curl_init();
curl_setopt($redirectch, CURLOPT_URL, $originalurl);
curl_setopt($redirectch, CURLOPT_HEADER, 1);
curl_setopt($redirectch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($redirectch, CURLOPT_NOBODY, false);
curl_setopt($redirectch, CURLOPT_FOLLOWLOCATION, true);
$redirectoutput = curl_exec($redirectch);
$doc = new DOMDocument();
#$doc->loadHTML($redirectoutput);
$nodes = $doc->getElementsByTagName('title');
$title = $nodes->item(0)->nodeValue;
echoappend ($file, $row['name'] . ": <b>redirect ... </b>" . $title . " ... ");
if (strpos(strtolower($title),strtolower($row['name']))===false) {
echoappend ($file, "FAIL<br />");
}
else {
$header = curl_getinfo($redirectch);
echoappend ($file, $header['url']);
echoappend ($file, "SUCCESS<br />");
}
curl_close($redirectch);
}
else
{
echoappend ($file, $row['name'] . ": <b>FAIL code</b>" . $urlinfo['http_code'] . "<br />");
}
curl_close($ch);
$totallinks++;
}
echoappend ($file, '<br />');
echoappend ($file, $site . ": " . $workinglinks . "/" . $totallinks . " links working. <br /><br />");
}
$conn = null;
echo '<br /><b>connection closed</b><br /><br />';
} catch(PDOException $e) {
echo 'ERROR: ' . $e->getMessage();
}

Short answer is use the curl_multi_* methods to parallelize your requests.
The reason for the slowness is that web requests are comparatively slow. Sometimes VERY slow. Using the curl_multi_* functions lets you run multiple requests simultaneously.
One thing to be careful about is to limit the number of requests you run at once. In other words, don't run 16,000 requests at once. Maybe start at 16 and see how that goes.
The following example should help you get started:
<?php
//
// Fetch a bunch of URLs in parallel. Returns an array of results indexed
// by URL.
//
function fetch_urls($urls, $curl_options = array()) {
$curl_multi = curl_multi_init();
$handles = array();
$options = $curl_options + array(
CURLOPT_HEADER => true,
CURLOPT_RETURNTRANSFER => true,
CURLOPT_NOBODY => true,
CURLOPT_FOLLOWLOCATION => true);
foreach($urls as $url) {
$handles[$url] = curl_init($url);
curl_setopt_array($handles[$url], $options);
curl_multi_add_handle($curl_multi, $handles[$url]);
}
$active = null;
do {
$status = curl_multi_exec($curl_multi, $active);
} while ($status == CURLM_CALL_MULTI_PERFORM);
while ($active && ($status == CURLM_OK)) {
if (curl_multi_select($curl_multi) != -1) {
do {
$status = curl_multi_exec($curl_multi, $active);
} while ($status == CURLM_CALL_MULTI_PERFORM);
}
}
if ($status != CURLM_OK) {
trigger_error("Curl multi read error $status\n", E_USER_WARNING);
}
$results = array();
foreach($handles as $url => $handle) {
$results[$url] = curl_getinfo($handle);
curl_multi_remove_handle($curl_multi, $handle);
curl_close($handle);
}
curl_multi_close($curl_multi);
return $results;
}
//
// The urls to test
//
$urls = array("http://google.com", "http://yahoo.com", "http://google.com/probably-bogus", "http://www.google.com.au");
//
// The number of URLs to test simultaneously
//
$request_limit = 2;
//
// Test URLs in batches
//
$redirected_urls = array();
for ($i = 0 ; $i < count($urls) ; $i += $request_limit) {
$results = fetch_urls(array_slice($urls, $i, $request_limit));
foreach($results as $url => $result) {
if ($result['http_code'] == 200) {
$status = "Worked!";
} else {
$status = "FAILED with {$result['http_code']}";
}
if ($result["redirect_count"] > 0) {
array_push($redirected_urls, $url);
echo "{$url}: ${status}\n";
} else {
echo "{$url}: redirected to {$result['url']} and {$status}\n";
}
}
}
//
// Handle redirected URLs
//
echo "Processing redirected URLs...\n";
for ($i = 0 ; $i < count($redirected_urls) ; $i += $request_limit) {
$results = fetch_urls(array_slice($redirected_urls, $i, $request_limit), array(CURLOPT_FOLLOWLOCATION => false));
foreach($results as $url => $result) {
if ($result['http_code'] == 301) {
echo "{$url} permanently redirected to {$result['url']}\n";
} else if ($result['http_code'] == 302) {
echo "{$url} termporarily redirected to {$result['url']}\n";
} else {
echo "{$url}: FAILED with {$result['http_code']}\n";
}
}
}
The above code processes a list of URLs in batches. It works in two passes. In the first pass, each request is configured to follow redirects and simply reports whether each URL ultimately lead to a successful request, or a failure.
The second pass processes any redirected URLs detected in the first pass and reports whether the redirect was a permanent redirection (meaning you can update your database with the new URL), or temporary (meaning you should NOT update your database).
NOTE:
In your original code, you have the following line, which will not work the way you expect it to:
else if ($urlinfo['http_code'] == 301 || 302)
The expression will ALWAYS return TRUE. The correct expression is:
else if ($urlinfo['http_code'] == 301 || $urlinfo['http_code'] == 302)

Also, put
set_time_limit(0);
at the top of your script to stop it aborting when it hits 30 seconds.

Related

Trying to access array offset on value of type null & isset problem in API call loop - 504 Gateway Time-out server

I am building cron job with API calls in loop for DB entries and Have performance issues.
Particularly in this part:
if (!empty($sudCode) && !empty($sudBroj) && isset($sudCode) && isset($sudBroj)) {
// echo $sudCode . "<br>";
// echo $sudBroj . "<br>";
$epredmet = ePredmeti($sudCode, $sudBroj);
// print_r($epredmet);
// echo "<br>";
if (isset($epredmet["data"]["prvi"]["lastUpdateTime"])) {
$lastUpdateTime = $epredmet["data"]["prvi"]["lastUpdateTime"];
$dateTime = str_replace("T", " ", $lastUpdateTime);
echo $nas . " - " . $dateTime . "<br>";
}
}
on line:
if (isset($epredmet["data"]["prvi"]["lastUpdateTime"])) {
I have few Databases and on one when this line is reached sever goes to 504 Gateway Time-out after 2 minutes.
Hosting company said that it goes in timeout because Apache web server waits for PHP parser to process data, what ever that means.
What is strange, is if I leave out that if check i script finishes and I get results but with Notice: Trying to access array offset on value of type null in
because I expect that $epredmet after API call looks like this:
- array(1) { ["data"]=> array(1) { ["prvi"]=> NULL } } // case not found
- array(1) { ["data"]=> array(1) { ["prvi"]=> array(1) { ["lastUpdateTime"]=> NULL } } } // case found but lastUpdateTime is not set, null
- array(1) { ["data"]=> array(1) { ["prvi"]=> array(1) { ["lastUpdateTime"]=> string(23) "2021-06-14T22:51:22.171" } } } // case found and lastUpdateTime is set
So what I need to do is filter out just last case where lastUpdateTime is set, and all that I read is suggesting to solve it with isset but that breaks my script for some reason.
PHP V 7.4
Please advise.
Im attaching full script in case someone notices problem somewhere else:
function eSudovi()
{
$endpoint = "xxx";
$qry = '{"query":"query{sudovi {id, sudNaziv}}"}';
$headers = array();
$headers[] = 'Content-Type: application/json';
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $endpoint);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_POSTFIELDS, $qry);
curl_setopt($ch, CURLOPT_POST, 1);
curl_setopt($ch, CURLOPT_HTTPHEADER, $headers);
$result = curl_exec($ch);
curl_close($ch);
return json_decode($result, true);
}
$eSudovi = eSudovi()["data"]["sudovi"];
function findSudCode($val, $eSudovi)
{
foreach ($eSudovi as $key => $value) {
if ($value["sudNaziv"] == $val) {
return $value["id"];
}
}
}
function ePredmeti($sud, $pred)
{
$endpoint = "xxx";
$qry = '{"query":"query{ prvi:predmet(sud: ' . $sud . ', oznakaBroj: \"' . $pred . '\") {lastUpdateTime}}"}';
$headers = array();
$headers[] = 'Content-Type: application/json';
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $endpoint);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_POSTFIELDS, $qry);
curl_setopt($ch, CURLOPT_POST, 1);
curl_setopt($ch, CURLOPT_HTTPHEADER, $headers);
$result = curl_exec($ch);
curl_close($ch);
return json_decode($result, true);
}
$results = mysqli_query($con, "
SELECT DISTINCT predf_nas_br, predf_odv, predf_SUD, predf_SUDBROJ
FROM PREDMETIFView
WHERE predf_SUD <> '' AND predf_SUDBROJ <> '' AND predf_SUDBROJ NOT LIKE '% %'
UNION ALL
SELECT DISTINCT predp_nas_br, predp_odv, predp_SUD, predp_SUDBROJ
FROM PREDMETIPView
WHERE predp_SUD <> '' AND predp_SUDBROJ <> '' AND predp_SUDBROJ NOT LIKE '% %'
;");
while ($row = $results->fetch_assoc()) {
foreach ($row as $key => $value) {
if ($key == "predf_nas_br") {
$nas = $value;
}
if ($key == "predf_SUD") {
$sud = trim($value);
if (!empty($sud) && isset($sud)) {
$sudCode = findSudCode($sud, $eSudovi);
}
};
if ($key == "predf_SUDBROJ") {
$sudBroj = trim($value);
};
if (!empty($sudCode) && !empty($sudBroj) && isset($sudCode) && isset($sudBroj)) {
// echo $sudCode . "<br>";
// echo $sudBroj . "<br>";
$epredmet = ePredmeti($sudCode, $sudBroj);
print_r($epredmet);
echo "<br>";
if (isset($epredmet["data"]["prvi"]["lastUpdateTime"])) {
$lastUpdateTime = $epredmet["data"]["prvi"]["lastUpdateTime"];
$dateTime = str_replace("T", " ", $lastUpdateTime);
echo $nas . " - " . $dateTime . "<br>";
}
}
}
};
// preg_match('/\s/', $sudBroj)
Edit:
I also tried this:
if (isset($epredmet["data"]["prvi"]["lastUpdateTime"]) && !empty($epredmet["data"]["prvi"]["lastUpdateTime"])) {
and this:
if (isset($epredmet["data"]["prvi"]) && !empty($epredmet["data"]["prvi"])) {
if (isset($epredmet["data"]["prvi"]["lastUpdateTime"]) && !empty($epredmet["data"]["prvi"]["lastUpdateTime"])) {
Same thing, it hangs, but without it all it work with erros.

Something like this would do to reuse the curl handle (note: haven't had time to test it, but you'll get the idea).
class ePredmeti{
public $epredmet;
private $curl,$ini_opt;
function __construct(){
$endpoint ='xxx';
$headers = ['Content-Type: application/json'];
$timeout = 30;
$this->curl= curl_init();
$this->ini_opt=[
CURLOPT_URL => $endpoint,
CURLOPT_RETURNTRANSFER => true,
CURLOPT_POST => true,
CURLOPT_HTTPHEADER => $headers,
CURLOPT_CONNECTTIMEOUT => $timeout,
CURLOPT_TIMEOUT => $timeout
];
}
public function _exec($sud, $pred){
$start=microtime(true);
$this->epredmet = null;
$query_opt=[
CURLOPT_POSTFIELDS=>
'{"query":"query{ prvi:predmet(sud: ' . $sud . ', oznakaBroj: \"' . $pred . '\") {lastUpdateTime}}"}'
];
curl_reset($this->curl);
curl_setopt_array($this->curl, $this->ini_opt);
curl_setopt_array($this->curl, $query_opt);
$ret = curl_exec($ch);
if (!curl_errno($this->curl)){
if(curl_getinfo($this->curl, CURLINFO_HTTP_CODE)!==200){
echo 'HTTP error: '.$http_code.'<br>';
}
else{
$this->epredmet = json_decode($ret,true);
}
}
else{
echo curl_error($this->curl).'<br>';
}
echo 'Took: '.(microtime(true)-$start).'<br>';
}
}
before the while() put something like:
$mycurl = new ePredmeti();
and instead of $epredmet = ePredmeti($sudCode, $sudBroj); use
$mycurl->_exec($sudCode, $sudBroj);
Finally, instead of if (isset($epredmet["data"]["prvi"]["lastUpdateTime"])) { you can use
if( isset($mycurl->epredmet["data"]["prvi"]["lastUpdateTime"]) ) {
The last one works because the class returns null on any error and isset() checks if a variable exists and is not null.

PHP Crawler going in Infinite Loop

This is the whole crawler code that I am trying to build. This code is a single domain crawler. But it has a big problem, when I checked the database it was saving some of the links again and again, which creates an infinite loop. I want to solve this problem without using my database because checking each link for a presence in my database will make this crawler slow. How can I do that? + If you have any suggestions to make it faster?
<?php
include_once('ganon.php');
ini_set('display_errors', '1');
function gethost($link)
{
$link = trim($link, '/');
if (!preg_match('#^http(s)?://#', $link))
{
$link = 'http://' . $link;
}
$urlParts = parse_url($link);
$domain = preg_replace('/^www\./', '', $urlParts['host']);
return $domain;
}
function store($raw, $link)
{
$html = str_get_dom($raw);
$title = $html('title', 0)->getPlainText();
$con = #mysqli_connect('somehost', 'someuser', 'somepassword', 'somedatabase');
if (!$con)
{
echo "Error: " . mysqli_connect_error();
exit();
}
$query = "INSERT INTO `somedatabase`.`sometable` (`title`, `url`) VALUES ('$title', '$link');";
mysqli_query($con, $query);
mysqli_close($con);
echo $title."<br>";
}
function crawl_save_crawl($target)
{
$curl = curl_init($target);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
$result = curl_exec($curl);
if(curl_errno($curl))
{
echo 'Curl error: ' . curl_error($curl);
}
curl_close($curl);
$dom = str_get_dom($result);
foreach($dom('a') as $element)
{
$href = $element->href;
if (0 !== strpos($href, 'http'))
{
$path = '/' . ltrim($href, '/');
if (extension_loaded('http'))
{
$href = http_build_url("http://www.".gethost($target), array('path' => $path));
}
else
{
$parts = parse_url("http://www.".gethost($target));
$href = $parts['scheme'] . '://';
if (isset($parts['user']) && isset($parts['pass']))
{
$href .= $parts['user'] . ':' . $parts['pass'] . '#';
}
$href .= $parts['host'];
if (isset($parts['port']))
{
$href .= ':' . $parts['port'];
}
$href .= dirname($parts['path'], 1).$path;
}
}
if (gethost($target) == gethost($href))
{
crawl_save_crawl($href);
}
}
store($result, $target);
}
$url=$_GET['u'];
crawl_save_crawl($url);
?>

In your crawl_save_crawl() function, you could store the links you've already visited and therefore stop the code going back to them. Using a static variable isn't ideal, but in such a limited piece of code it serves the purpose it was intended for (to hold it's value across calls).
This doesn't stop it searching things other pages have searched, but stops it looping in itself.
function crawl_save_crawl($target)
{
static $alreadyDone = null;
if ( $alreadyDone == null ) {
$alreadyDone = [$target];
}
In the first loop, this will add the current reference in as this would previously have caused it to be missing.
Then before you call the same routing, you can test if it's already been checked...
$visit = trim(str_replace(["http://","https://"], "", $href), '/');
if (in_array($visit, $alreadyDone) === false &&
gethost($target) == gethost($href))
{
$alreadyDone[] = $visit;
crawl_save_crawl($href);
}
This may still look as though it's visiting the same page, but as your logic sometimes creates a href with 'www.' at the start, this may be different than without it. So when ccrawling stackoverflow, this means there is stackoverflow.com and www.stackoverflow.com.

PHP Parallel curl requests

I am doing a simple app that reads json data from 15 different URLs. I have a special need that I need to do this serverly. I am using file_get_contents($url).
Since I am using file_get_contents($url). I wrote a simple script, is it:
$websites = array(
$url1,
$url2,
$url3,
...
$url15
);
foreach ($websites as $website) {
$data[] = file_get_contents($website);
}
and it was proven to be very slow, because it waits for the first request and then do the next one.

If you mean multi-curl then, something like this might help:
$nodes = array($url1, $url2, $url3);
$node_count = count($nodes);
$curl_arr = array();
$master = curl_multi_init();
for($i = 0; $i < $node_count; $i++)
{
$url =$nodes[$i];
$curl_arr[$i] = curl_init($url);
curl_setopt($curl_arr[$i], CURLOPT_RETURNTRANSFER, true);
curl_multi_add_handle($master, $curl_arr[$i]);
}
do {
curl_multi_exec($master,$running);
} while($running > 0);
for($i = 0; $i < $node_count; $i++)
{
$results[] = curl_multi_getcontent ( $curl_arr[$i] );
}
print_r($results);

i don't particularly like the approach of any of the existing answers
Timo's code: might sleep/select() during CURLM_CALL_MULTI_PERFORM which is wrong, it might also fail to sleep when ($still_running > 0 && $exec != CURLM_CALL_MULTI_PERFORM) which may make the code spin at 100% cpu usage (of 1 core) for no reason
Sudhir's code: will not sleep when $still_running > 0 , and spam-call the async-function curl_multi_exec() until everything has been downloaded, which cause php to use 100% cpu (of 1 cpu core) until everything has been downloaded, in other words it fails to sleep while downloading
here's an approach with neither of those issues:
$websites = array(
"http://google.com",
"http://example.org"
// $url2,
// $url3,
// ...
// $url15
);
$mh = curl_multi_init();
foreach ($websites as $website) {
$worker = curl_init($website);
curl_setopt_array($worker, [
CURLOPT_RETURNTRANSFER => 1
]);
curl_multi_add_handle($mh, $worker);
}
for (;;) {
$still_running = null;
do {
$err = curl_multi_exec($mh, $still_running);
} while ($err === CURLM_CALL_MULTI_PERFORM);
if ($err !== CURLM_OK) {
// handle curl multi error?
}
if ($still_running < 1) {
// all downloads completed
break;
}
// some haven't finished downloading, sleep until more data arrives:
curl_multi_select($mh, 1);
}
$results = [];
while (false !== ($info = curl_multi_info_read($mh))) {
if ($info["result"] !== CURLE_OK) {
// handle download error?
}
$results[curl_getinfo($info["handle"], CURLINFO_EFFECTIVE_URL)] = curl_multi_getcontent($info["handle"]);
curl_multi_remove_handle($mh, $info["handle"]);
curl_close($info["handle"]);
}
curl_multi_close($mh);
var_export($results);
note that an issue shared by all 3 approaches here (my answer, and Sudhir's answer, and Timo's answer) is that they will open all connections simultaneously, if you have 1,000,000 websites to fetch, these scripts will try to open 1,000,000 connections simultaneously. if you need to like.. only download 50 websites at a time, or something like that, maybe try:
$websites = array(
"http://google.com",
"http://example.org"
// $url2,
// $url3,
// ...
// $url15
);
var_dump(fetch_urls($websites,50));
function fetch_urls(array $urls, int $max_connections, int $timeout_ms = 10000, bool $return_fault_reason = true): array
{
if ($max_connections < 1) {
throw new InvalidArgumentException("max_connections MUST be >=1");
}
foreach ($urls as $key => $foo) {
if (! is_string($foo)) {
throw new \InvalidArgumentException("all urls must be strings!");
}
if (empty($foo)) {
unset($urls[$key]); // ?
}
}
unset($foo);
// DISABLED for benchmarking purposes: $urls = array_unique($urls); // remove duplicates.
$ret = array();
$mh = curl_multi_init();
$workers = array();
$work = function () use (&$ret, &$workers, &$mh, $return_fault_reason) {
// > If an added handle fails very quickly, it may never be counted as a running_handle
while (1) {
do {
$err = curl_multi_exec($mh, $still_running);
} while ($err === CURLM_CALL_MULTI_PERFORM);
if ($still_running < count($workers)) {
// some workers finished, fetch their response and close them
break;
}
$cms = curl_multi_select($mh, 1);
// var_dump('sr: ' . $still_running . " c: " . count($workers)." cms: ".$cms);
}
while (false !== ($info = curl_multi_info_read($mh))) {
// echo "NOT FALSE!";
// var_dump($info);
{
if ($info['msg'] !== CURLMSG_DONE) {
continue;
}
if ($info['result'] !== CURLE_OK) {
if ($return_fault_reason) {
$ret[$workers[(int) $info['handle']]] = print_r(array(
false,
$info['result'],
"curl_exec error " . $info['result'] . ": " . curl_strerror($info['result'])
), true);
}
} elseif (CURLE_OK !== ($err = curl_errno($info['handle']))) {
if ($return_fault_reason) {
$ret[$workers[(int) $info['handle']]] = print_r(array(
false,
$err,
"curl error " . $err . ": " . curl_strerror($err)
), true);
}
} else {
$ret[$workers[(int) $info['handle']]] = curl_multi_getcontent($info['handle']);
}
curl_multi_remove_handle($mh, $info['handle']);
assert(isset($workers[(int) $info['handle']]));
unset($workers[(int) $info['handle']]);
curl_close($info['handle']);
}
}
// echo "NO MORE INFO!";
};
foreach ($urls as $url) {
while (count($workers) >= $max_connections) {
// echo "TOO MANY WORKERS!\n";
$work();
}
$neww = curl_init($url);
if (! $neww) {
trigger_error("curl_init() failed! probably means that max_connections is too high and you ran out of system resources", E_USER_WARNING);
if ($return_fault_reason) {
$ret[$url] = array(
false,
- 1,
"curl_init() failed"
);
}
continue;
}
$workers[(int) $neww] = $url;
curl_setopt_array($neww, array(
CURLOPT_RETURNTRANSFER => 1,
CURLOPT_SSL_VERIFYHOST => 0,
CURLOPT_SSL_VERIFYPEER => 0,
CURLOPT_TIMEOUT_MS => $timeout_ms
));
curl_multi_add_handle($mh, $neww);
// curl_multi_exec($mh, $unused_here); LIKELY TO BE MUCH SLOWER IF DONE IN THIS LOOP: TOO MANY SYSCALLS
}
while (count($workers) > 0) {
// echo "WAITING FOR WORKERS TO BECOME 0!";
// var_dump(count($workers));
$work();
}
curl_multi_close($mh);
return $ret;
}
that will download the entire list and not download more than 50 urls simultaneously
(but even that approach stores all the results in-ram, so even that approach may end up running out of ram; if you want to store it in a database instead of in ram, the curl_multi_getcontent part can be modified to store it in a database instead of in a ram-persistent variable.)

I would like to provide a more complete example without hitting the CPU at 100% and crashing when there's a slight error or something unexpected.
It also shows you how to fetch the headers, the body, request info and manual redirect following.
Disclaimer, this code is intended to be extended and implemented into a library or as a quick starting point, and as such the functions inside of it are kept to a minimum.
function mtime(){
return microtime(true);
}
function ptime($prev){
$t = microtime(true) - $prev;
$t = $t * 1000;
return str_pad($t, 20, 0, STR_PAD_RIGHT);
}
// This function exists to add compatibility for CURLM_CALL_MULTI_PERFORM for old curl versions, on modern curl it will only run once and be the equivalent of calling curl_multi_exec
function curl_multi_exec_full($mh, &$still_running) {
// In theory curl_multi_exec should never return CURLM_CALL_MULTI_PERFORM (-1) because it has been deprecated
// In practice it sometimes does
// So imagine that this just runs curl_multi_exec once and returns it's value
do {
$state = curl_multi_exec($mh, $still_running);
// curl_multi_select($mh, $timeout) simply blocks for $timeout seconds while curl_multi_exec() returns CURLM_CALL_MULTI_PERFORM
// We add it to prevent CPU 100% usage in case this thing misbehaves (especially for old curl on windows)
} while ($still_running > 0 && $state === CURLM_CALL_MULTI_PERFORM && curl_multi_select($mh, 0.1));
return $state;
}
// This function replaces curl_multi_select and makes the name make more sense, since all we're doing is waiting for curl, it also forces a minimum sleep time between requests to avoid excessive CPU usage.
function curl_multi_wait($mh, $minTime = 0.001, $maxTime = 1){
$umin = $minTime*1000000;
$start_time = microtime(true);
// it sleeps until there is some activity on any of the descriptors (curl files)
// it returns the number of descriptors (curl files that can have activity)
$num_descriptors = curl_multi_select($mh, $maxTime);
// if the system returns -1, it means that the wait time is unknown, and we have to decide the minimum time to wait
// but our `$timespan` check below catches this edge case, so this `if` isn't really necessary
if($num_descriptors === -1){
usleep($umin);
}
$timespan = (microtime(true) - $start_time);
// This thing runs very fast, up to 1000 times for 2 urls, which wastes a lot of CPU
// This will reduce the runs so that each interval is separated by at least minTime
if($timespan < $umin){
usleep($umin - $timespan);
//print "sleep for ".($umin - $timeDiff).PHP_EOL;
}
}
$handles = [
[
CURLOPT_URL=>"http://example.com/",
CURLOPT_HEADER=>false,
CURLOPT_RETURNTRANSFER=>true,
CURLOPT_FOLLOWLOCATION=>false,
],
[
CURLOPT_URL=>"http://www.php.net",
CURLOPT_HEADER=>false,
CURLOPT_RETURNTRANSFER=>true,
CURLOPT_FOLLOWLOCATION=>false,
// this function is called by curl for each header received
// This complies with RFC822 and RFC2616, please do not suggest edits to make use of the mb_ string functions, it is incorrect!
// https://stackoverflow.com/a/41135574
CURLOPT_HEADERFUNCTION=>function($ch, $header)
{
print "header from http://www.php.net: ".$header;
//$header = explode(':', $header, 2);
//if (count($header) < 2){ // ignore invalid headers
// return $len;
//}
//$headers[strtolower(trim($header[0]))][] = trim($header[1]);
return strlen($header);
}
]
];
//create the multiple cURL handle
$mh = curl_multi_init();
$chandles = [];
foreach($handles as $opts) {
// create cURL resources
$ch = curl_init();
// set URL and other appropriate options
curl_setopt_array($ch, $opts);
// add the handle
curl_multi_add_handle($mh, $ch);
$chandles[] = $ch;
}
//execute the multi handle
$prevRunning = null;
$count = 0;
do {
$time = mtime();
// $running contains the number of currently running requests
$status = curl_multi_exec_full($mh, $running);
$count++;
print ptime($time).": curl_multi_exec status=$status running $running".PHP_EOL;
// One less is running, meaning one has finished
if($running < $prevRunning){
print ptime($time).": curl_multi_info_read".PHP_EOL;
// msg: The CURLMSG_DONE constant. Other return values are currently not available.
// result: One of the CURLE_* constants. If everything is OK, the CURLE_OK will be the result.
// handle: Resource of type curl indicates the handle which it concerns.
while ($read = curl_multi_info_read($mh, $msgs_in_queue)) {
$info = curl_getinfo($read['handle']);
if($read['result'] !== CURLE_OK){
// handle the error somehow
print "Error: ".$info['url'].PHP_EOL;
}
if($read['result'] === CURLE_OK){
/*
// This will automatically follow the redirect and still give you control over the previous page
// TODO: max redirect checks and redirect timeouts
if(isset($info['redirect_url']) && trim($info['redirect_url'])!==''){
print "running redirect: ".$info['redirect_url'].PHP_EOL;
$ch3 = curl_init();
curl_setopt($ch3, CURLOPT_URL, $info['redirect_url']);
curl_setopt($ch3, CURLOPT_HEADER, 0);
curl_setopt($ch3, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch3, CURLOPT_FOLLOWLOCATION, 0);
curl_multi_add_handle($mh,$ch3);
}
*/
print_r($info);
$body = curl_multi_getcontent($read['handle']);
print $body;
}
}
}
// Still running? keep waiting...
if ($running > 0) {
curl_multi_wait($mh);
}
$prevRunning = $running;
} while ($running > 0 && $status == CURLM_OK);
//close the handles
foreach($chandles as $ch){
curl_multi_remove_handle($mh, $ch);
}
curl_multi_close($mh);
print $count.PHP_EOL;

Has any one got class.openid.php working with google openID?

I am trying to study class.openid.php because it is simpler and smaller than
lightopenid. for my purposes 200 lines do matter. But class.openid.php does not work with google openID https://www.google.com/accounts/o8/id, prints to me such error:
ERROR CODE: OPENID_NOSERVERSFOUND
ERROR DESCRIPTION: Cannot find OpenID Server TAG on Identity page.
is it possible to make class.openid.php (any version) work with google openID and how to do such thing?
class.openid.php can be taken here but it did not worked for me out of the box so I had to find all <? and replace tham with <?php in case someone would like to see code I've got:
html interface page:
<?php
require('class.openid.v3.php');
if ($_POST['openid_action'] == "login"){ // Get identity from user and redirect browser to OpenID Server
$openid = new SimpleOpenID;
$openid->SetIdentity($_POST['openid_url']);
$openid->SetTrustRoot('http://' . $_SERVER["HTTP_HOST"]);
$openid->SetRequiredFields(array('email','fullname'));
$openid->SetOptionalFields(array('dob','gender','postcode','country','language','timezone'));
if ($openid->GetOpenIDServer()){
$openid->SetApprovedURL('http://' . $_SERVER["HTTP_HOST"] . $_SERVER["PATH_INFO"]); // Send Response from OpenID server to this script
$openid->Redirect(); // This will redirect user to OpenID Server
}else{
$error = $openid->GetError();
echo "ERROR CODE: " . $error['code'] . "<br>";
echo "ERROR DESCRIPTION: " . $error['description'] . "<br>";
}
exit;
}
else if($_GET['openid_mode'] == 'id_res'){ // Perform HTTP Request to OpenID server to validate key
$openid = new SimpleOpenID;
$openid->SetIdentity($_GET['openid_identity']);
$openid_validation_result = $openid->ValidateWithServer();
if ($openid_validation_result == true){ // OK HERE KEY IS VALID
echo "VALID";
}else if($openid->IsError() == true){ // ON THE WAY, WE GOT SOME ERROR
$error = $openid->GetError();
echo "ERROR CODE: " . $error['code'] . "<br>";
echo "ERROR DESCRIPTION: " . $error['description'] . "<br>";
}else{ // Signature Verification Failed
echo "INVALID AUTHORIZATION";
}
}else if ($_GET['openid_mode'] == 'cancel'){ // User Canceled your Request
echo "USER CANCELED REQUEST";
}
?>
<html>
<head>
<title>OpenID Example</title>
</head>
<body>
<div>
<fieldset id="openid">
<legend>OpenID Login</legend>
<form action="<?php echo 'http://' . $_SERVER["HTTP_HOST"] . $_SERVER["PATH_INFO"]; ?>" method="post" onsubmit="this.login.disabled=true;">
<input type="hidden" name="openid_action" value="login">
<div><input type="text" name="openid_url" class="openid_login"><input type="submit" name="login" value="login >>"></div>
<div><a href="http://www.myopenid.com/" class="link" >Get an OpenID</a></div>
</form>
</fieldset>
</div>
<div style="margin-top: 2em; font-family: arial; font-size: 0.8em; border-top:1px solid gray; padding: 4px;">Sponsored by: FiveStores - get your free online store; includes extensive API for developers; <i style="color: gray;">integrated with OpenID</i></div>
</body>
</html>
and php class
<?php
/*
FREE TO USE Under License: GPLv3
Simple OpenID PHP Class
Some modifications by Eddie Roosenmaallen, eddie#roosenmaallen.com
*/
class SimpleOpenID{
var $openid_url_identity;
var $URLs = array();
var $error = array();
var $fields = array(
'required' => array(),
'optional' => array(),
);
function SimpleOpenID(){
if (!function_exists('curl_exec')) {
die('Error: Class SimpleOpenID requires curl extension to work');
}
}
function SetOpenIDServer($a){
$this->URLs['openid_server'] = $a;
}
function SetTrustRoot($a){
$this->URLs['trust_root'] = $a;
}
function SetCancelURL($a){
$this->URLs['cancel'] = $a;
}
function SetApprovedURL($a){
$this->URLs['approved'] = $a;
}
function SetRequiredFields($a){
if (is_array($a)){
$this->fields['required'] = $a;
}else{
$this->fields['required'][] = $a;
}
}
function SetOptionalFields($a){
if (is_array($a)){
$this->fields['optional'] = $a;
}else{
$this->fields['optional'][] = $a;
}
}
function SetIdentity($a){ // Set Identity URL
if ((stripos($a, 'http://') === false)
&& (stripos($a, 'https://') === false)){
$a = 'http://'.$a;
}
$this->openid_url_identity = $a;
}
function GetIdentity(){ // Get Identity
return $this->openid_url_identity;
}
function GetError(){
$e = $this->error;
return array('code'=>$e[0],'description'=>$e[1]);
}
function ErrorStore($code, $desc = null){
$errs['OPENID_NOSERVERSFOUND'] = 'Cannot find OpenID Server TAG on Identity page.';
if ($desc == null){
$desc = $errs[$code];
}
$this->error = array($code,$desc);
}
function IsError(){
if (count($this->error) > 0){
return true;
}else{
return false;
}
}
function splitResponse($response) {
$r = array();
$response = explode("\n", $response);
foreach($response as $line) {
$line = trim($line);
if ($line != "") {
list($key, $value) = explode(":", $line, 2);
$r[trim($key)] = trim($value);
}
}
return $r;
}
function OpenID_Standarize($openid_identity = null){
if ($openid_identity === null)
$openid_identity = $this->openid_url_identity;
$u = parse_url(strtolower(trim($openid_identity)));
if (!isset($u['path']) || ($u['path'] == '/')) {
$u['path'] = '';
}
if(substr($u['path'],-1,1) == '/'){
$u['path'] = substr($u['path'], 0, strlen($u['path'])-1);
}
if (isset($u['query'])){ // If there is a query string, then use identity as is
return $u['host'] . $u['path'] . '?' . $u['query'];
}else{
return $u['host'] . $u['path'];
}
}
function array2url($arr){ // converts associated array to URL Query String
if (!is_array($arr)){
return false;
}
$query = '';
foreach($arr as $key => $value){
$query .= $key . "=" . $value . "&";
}
return $query;
}
function CURL_Request($url, $method="GET", $params = "") { // Remember, SSL MUST BE SUPPORTED
if (is_array($params)) $params = $this->array2url($params);
$curl = curl_init($url . ($method == "GET" && $params != "" ? "?" . $params : ""));
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($curl, CURLOPT_HEADER, false);
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($curl, CURLOPT_HTTPGET, ($method == "GET"));
curl_setopt($curl, CURLOPT_POST, ($method == "POST"));
if ($method == "POST") curl_setopt($curl, CURLOPT_POSTFIELDS, $params);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
$response = curl_exec($curl);
if (curl_errno($curl) == 0){
$response;
}else{
$this->ErrorStore('OPENID_CURL', curl_error($curl));
}
return $response;
}
function HTML2OpenIDServer($content) {
$get = array();
// Get details of their OpenID server and (optional) delegate
preg_match_all('/<link[^>]*rel=[\'"]openid.server[\'"][^>]*href=[\'"]([^\'"]+)[\'"][^>]*\/?>/i', $content, $matches1);
preg_match_all('/<link[^>]*href=\'"([^\'"]+)[\'"][^>]*rel=[\'"]openid.server[\'"][^>]*\/?>/i', $content, $matches2);
$servers = array_merge($matches1[1], $matches2[1]);
preg_match_all('/<link[^>]*rel=[\'"]openid.delegate[\'"][^>]*href=[\'"]([^\'"]+)[\'"][^>]*\/?>/i', $content, $matches1);
preg_match_all('/<link[^>]*href=[\'"]([^\'"]+)[\'"][^>]*rel=[\'"]openid.delegate[\'"][^>]*\/?>/i', $content, $matches2);
$delegates = array_merge($matches1[1], $matches2[1]);
$ret = array($servers, $delegates);
return $ret;
}
function GetOpenIDServer(){
$response = $this->CURL_Request($this->openid_url_identity);
list($servers, $delegates) = $this->HTML2OpenIDServer($response);
if (count($servers) == 0){
$this->ErrorStore('OPENID_NOSERVERSFOUND');
return false;
}
if (isset($delegates[0])
&& ($delegates[0] != "")){
$this->SetIdentity($delegates[0]);
}
$this->SetOpenIDServer($servers[0]);
return $servers[0];
}
function GetRedirectURL(){
$params = array();
$params['openid.return_to'] = urlencode($this->URLs['approved']);
$params['openid.mode'] = 'checkid_setup';
$params['openid.identity'] = urlencode($this->openid_url_identity);
$params['openid.trust_root'] = urlencode($this->URLs['trust_root']);
if (isset($this->fields['required'])
&& (count($this->fields['required']) > 0)) {
$params['openid.sreg.required'] = implode(',',$this->fields['required']);
}
if (isset($this->fields['optional'])
&& (count($this->fields['optional']) > 0)) {
$params['openid.sreg.optional'] = implode(',',$this->fields['optional']);
}
return $this->URLs['openid_server'] . "?". $this->array2url($params);
}
function Redirect(){
$redirect_to = $this->GetRedirectURL();
if (headers_sent()){ // Use JavaScript to redirect if content has been previously sent (not recommended, but safe)
echo '<script language="JavaScript" type="text/javascript">window.location=\'';
echo $redirect_to;
echo '\';</script>';
}else{ // Default Header Redirect
header('Location: ' . $redirect_to);
}
}
function ValidateWithServer(){
$params = array(
'openid.assoc_handle' => urlencode($_GET['openid_assoc_handle']),
'openid.signed' => urlencode($_GET['openid_signed']),
'openid.sig' => urlencode($_GET['openid_sig'])
);
// Send only required parameters to confirm validity
$arr_signed = explode(",",str_replace('sreg.','sreg_',$_GET['openid_signed']));
for ($i=0; $i<count($arr_signed); $i++){
$s = str_replace('sreg_','sreg.', $arr_signed[$i]);
$c = $_GET['openid_' . $arr_signed[$i]];
// if ($c != ""){
$params['openid.' . $s] = urlencode($c);
// }
}
$params['openid.mode'] = "check_authentication";
$openid_server = $this->GetOpenIDServer();
if ($openid_server == false){
return false;
}
$response = $this->CURL_Request($openid_server,'POST',$params);
$data = $this->splitResponse($response);
if ($data['is_valid'] == "true") {
return true;
}else{
return false;
}
}
}
?>

The problem is that Google doesn't just supply an OpenID endpoint.
OpenId endpoints include an identifier for the user.
What we are having here is called a Discovery Url.
This is a static url that you can direct any user to, and the service itself will recognise the user and return a per-user unique identifying url.
This however is NOT implemented correctly by most openid client libraries, including the majority linked on the official openid website.
Even the Zend Framework libraries are incapable of handling that.
However I found a class that I analysed from various perspectives and that I am very satisfied with. At the company I work at we already integrated it successfully in several production environments and have not experienced any problems.
You may also be interested in another post of mine dealing with the issue of making Facebook an openid Provider. The class I am using, that also supports Google, can also be found there:
Best way to implement Single-Sign-On with all major providers?

The class in your question does not support OpenID 2.0 at all. Therefore, it will not work with Google without adding a lot of code.

Are you searching something like :
http://wiki.openid.net/w/page/12995176/Libraries
?
There is a PHP section in that.

Checking for site reciprocal link

I'm trying to make a script that will load a desired URL (as entered by user) and check if that page links back to my domain before their domain is published on my site. I'm not very experienced with regular expressions and this is what I have so far:
$loaded = file_get_contents('http://localhost/small_script/page.php');
// $loaded will be equal to the users site they have submitted
$current_site = 'site2.com';
// $current_site is the domain of my site, this the the URL that must be found in target site
$matches = Array();
$find = preg_match_all('/<a(.*?)href=[\'"](.*?)[\'"](.*?)\b[^>]*>(.*?)<\/a>/i', $loaded, $matches);
$c = count($matches[0]);
$z = 0;
while($z<$c){
$full_link = $matches[0][$z];
$href = $matches[2][$z];
$z++;
$check = strpos($href,$current_site);
if($check === false) {
}else{
// The link cannot have the "no follow" tag, this is to check if it does and if so, return a specific error
$pos = strpos($full_link,'no follow');
if($pos === false) {
echo $href;
}
else {
//echo "rel=no follow FOUND";
}
}
}
As you can see, it's pretty messy and I'm entirely sure where it's headed. I was hoping someone could give me a small, fast and concise script that would do exactly what I've attempted.
Load specified URL as entered by user
Check if specified URL links back to my site (if not, return error code #1)
If link is there, check for 'no follow', if found return error code #2
If everything is OK, set a variable to true, so I can continue with other functions (like displaying their link on my page)

this is the code :)
helped by http://www.merchantos.com/makebeta/php/scraping-links-with-php/
<?php
$my_url = 'http://online.bulsam.net';
$target_url = 'http://www.bulsam.net';
$userAgent = 'Googlebot/2.1 (http://www.googlebot.com/bot.html)';
// make the cURL request to $target_url
$ch = curl_init();
curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);
curl_setopt($ch, CURLOPT_URL,$target_url);
curl_setopt($ch, CURLOPT_FAILONERROR, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
curl_setopt($ch, CURLOPT_TIMEOUT, 10);
$html= curl_exec($ch);
if (!$html) {
echo "<br />cURL error number:" .curl_errno($ch);
echo "<br />cURL error:" . curl_error($ch);
exit;
}
// parse the html into a DOMDocument
$dom = new DOMDocument();
#$dom->loadHTML($html);
// grab all the on the page
$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body//a");
// find result
$result = is_my_link_there($hrefs, $my_url);
if ($result == 1) {
echo 'There is no link!!!';
} elseif ($result == 2) {
echo 'There is, but it is NO FOLLOW !!!';
} else {
// blah blah blah
}
// used functions
function is_my_link_there($hrefs, $my_url) {
for ($i = 0; $i < $hrefs->length; $i++) {
$href = $hrefs->item($i);
$url = $href->getAttribute('href');
if ($my_url == $url) {
$rel = $href->getAttribute('rel');
if ($rel == 'nofollow') {
return 2;
}
return 3;
}
}
return 1;
}

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Optimising PHP cURL based link checker script - currently very slow - php

Also, put set_time_limit(0); at the top of your script to stop it aborting when it hits 30 seconds.

Related

Trying to access array offset on value of type null & isset problem in API call loop - 504 Gateway Time-out server

PHP Crawler going in Infinite Loop

PHP Parallel curl requests

Has any one got class.openid.php working with google openID?

Checking for site reciprocal link

Categories

Resources