recursion problems - php

I'm grabbing links from a website, but I'm having a problem in which the higher I set the recursion depth for the function the results become stranger
for example
when I set the function to the following
crawl_page("http://www.mangastream.com/", 10);
I will get a results like this for about half the page
http://mangastream.com/read/naruto/51619850/1/read/naruto/51619850/2/read/naruto/51619850/2/read/naruto/51619850/2/read/naruto/51619850/2/read/naruto/51619850/2/read/naruto/51619850/2/read/naruto/51619850/2
EDIT
while I'm expecting results like this instead
http://mangastream.com/manga/read/naruto/51619850/1
here's the function I've been using to get the results
function crawl_page($url, $depth)
{
static $seen = array();
if (isset($seen[$url]) || $depth === 0) {
return;
}
$seen[$url] = true;
$dom = new DOMDocument('1.0');
#$dom->loadHTMLFile($url);
$anchors = $dom->getElementsByTagName('a');
foreach ($anchors as $element) {
$href = $element->getAttribute('href');
if (0 !== strpos($href, 'http')) {
$href = rtrim($url, '/') . '/' . ltrim($href, '/');
}
if(shouldScrape($href)==true)
crawl_page($href, $depth - 1);
}
echo $url,"\r";
//,pageStatus($url)
}
any help with this would be greatly appreciated

the construction of your new url is not correct, replace :
$href = rtrim($url, '/') . '/' . ltrim($href, '/');
with :
if (substr($href, 0, 1)=='/') {
// href relative to root
$info = parse_url($url);
$href = $info['scheme'].'//'.$info['host'].$href;
} else {
// href relative to current path
$href = rtrim(dirname($url), '/') . '/' . $href;
}

I think your problem lies in this line:
$href = rtrim($url, '/') . '/' . ltrim($href, '/');
To all relative urls on any given page this statement will prepend the current page url, which is obviously not what you want. What you need is to prepend only the protocol and host part of the URL.
Something like this should fix your problem (untested):
$url_parts = parse_url($url);
$href = $url_parts['scheme'] . '://' . $url_parts['host '] . $href;

Related

How to avoid url with mailto:

I'm working in php and I have created a function that is getting links from a submitted url.
The code is working fine, but it is picking even links that are not active like mailto:, , javascript:void(0).
How can I avoid picking up a tags whose href are like: href="mailto:" ; href="tel:"; href="javascript:"?
Thanks you in advance.
function check_all_links($url) {
$doc = new DOMDocument();
#$doc->loadHTML(file_get_contents($url));
$linklist = $doc->getElementsByTagName("a");
$title = $doc->getElementsByTagName("title");
$href = array();
$page_url = $full_url = $new_url = "";
$full_url = goodUrl($url);
$scheme = parse_url($url, PHP_URL_SCHEME);
$slash = '/';
$links = array();
$linkNo = array();
if ($scheme == "http") {
foreach ($linklist as $link) {
$href = strtolower($link->getAttribute('href'));
$page_url = parse_url($href, PHP_URL_PATH);
$new_url = $scheme."://".$full_url.'/'.ltrim($page_url, '/');
//check if href has mailto: or # or javascipt() or tel:
if (strpos($page_url, "tel:") === True) {
continue;
}
if(!in_array($new_url, $linkNo)) {
echo $new_url."<br>" ;
array_push($linkNo, $new_url);
$links[] = array('Links' => $new_url );
}
}
}else if ($scheme == "https") {
foreach ($linklist as $link) {
$href = strtolower($link->getAttribute('href'));
$page_url = parse_url($href, PHP_URL_PATH);
$new_url = $scheme."://".$full_url.'/'.ltrim($page_url, '/');
if (strpos($page_url, "tel:") === True) {
continue;
}
if(!in_array($new_url, $linkNo)) {
echo $new_url."<br>" ;
array_push($linkNo, $new_url);
$links[] = array('Links' => $new_url );
}
}
}
You can use the scheme field from the parse_url function result.
Instead of:
if (strpos($page_url, "tel:") === True) {
continue;
}
you can use:
if (isset($page_url["scheme"] && in_array($page_url["scheme"], ["mailto", "tel", "javascript"]) {
continue;
}

PHP equivalent of Javascripts element.href [duplicate]

How to, using php, transform relative path to absolute URL?
function rel2abs($rel, $base)
{
/* return if already absolute URL */
if (parse_url($rel, PHP_URL_SCHEME) != '') return $rel;
/* queries and anchors */
if ($rel[0]=='#' || $rel[0]=='?') return $base.$rel;
/* parse base URL and convert to local variables:
$scheme, $host, $path */
extract(parse_url($base));
/* remove non-directory element from path */
$path = preg_replace('#/[^/]*$#', '', $path);
/* destroy path if relative url points to root */
if ($rel[0] == '/') $path = '';
/* dirty absolute URL */
$abs = "$host$path/$rel";
/* replace '//' or '/./' or '/foo/../' with '/' */
$re = array('#(/\.?/)#', '#/(?!\.\.)[^/]+/\.\./#');
for($n=1; $n>0; $abs=preg_replace($re, '/', $abs, -1, $n)) {}
/* absolute URL is ready! */
return $scheme.'://'.$abs;
}
I love the code that jordanstephens provided from the link! I voted it up. l0oky inspired me to make sure that the function is port, username, and password URL compatible. I needed it for my project.
function rel2abs( $rel, $base )
{
/* return if already absolute URL */
if( parse_url($rel, PHP_URL_SCHEME) != '' )
return( $rel );
/* queries and anchors */
if( $rel[0]=='#' || $rel[0]=='?' )
return( $base.$rel );
/* parse base URL and convert to local variables:
$scheme, $host, $path */
extract( parse_url($base) );
/* remove non-directory element from path */
$path = preg_replace( '#/[^/]*$#', '', $path );
/* destroy path if relative url points to root */
if( $rel[0] == '/' )
$path = '';
/* dirty absolute URL */
$abs = '';
/* do we have a user in our URL? */
if( isset($user) )
{
$abs.= $user;
/* password too? */
if( isset($pass) )
$abs.= ':'.$pass;
$abs.= '#';
}
$abs.= $host;
/* did somebody sneak in a port? */
if( isset($port) )
$abs.= ':'.$port;
$abs.=$path.'/'.$rel;
/* replace '//' or '/./' or '/foo/../' with '/' */
$re = array('#(/\.?/)#', '#/(?!\.\.)[^/]+/\.\./#');
for( $n=1; $n>0; $abs=preg_replace( $re, '/', $abs, -1, $n ) ) {}
/* absolute URL is ready! */
return( $scheme.'://'.$abs );
}
Added support to keep the current query. Helps a lot for ?page=1 and so on...
function rel2abs($rel, $base)
{
/* return if already absolute URL */
if (parse_url($rel, PHP_URL_SCHEME) != '')
return ($rel);
/* queries and anchors */
if ($rel[0] == '#' || $rel[0] == '?')
return ($base . $rel);
/* parse base URL and convert to local variables: $scheme, $host, $path, $query, $port, $user, $pass */
extract(parse_url($base));
/* remove non-directory element from path */
$path = preg_replace('#/[^/]*$#', '', $path);
/* destroy path if relative url points to root */
if ($rel[0] == '/')
$path = '';
/* dirty absolute URL */
$abs = '';
/* do we have a user in our URL? */
if (isset($user)) {
$abs .= $user;
/* password too? */
if (isset($pass))
$abs .= ':' . $pass;
$abs .= '#';
}
$abs .= $host;
/* did somebody sneak in a port? */
if (isset($port))
$abs .= ':' . $port;
$abs .= $path . '/' . $rel . (isset($query) ? '?' . $query : '');
/* replace '//' or '/./' or '/foo/../' with '/' */
$re = ['#(/\.?/)#', '#/(?!\.\.)[^/]+/\.\./#'];
for ($n = 1; $n > 0; $abs = preg_replace($re, '/', $abs, -1, $n)) {
}
/* absolute URL is ready! */
return ($scheme . '://' . $abs);
}
I updated the function to fix relative URL starting with '//' improving execution speed.
function getAbsoluteUrl($relativeUrl, $baseUrl){
// if already absolute URL
if (parse_url($relativeUrl, PHP_URL_SCHEME) !== null){
return $relativeUrl;
}
// queries and anchors
if ($relativeUrl[0] === '#' || $relativeUrl[0] === '?'){
return $baseUrl.$relativeUrl;
}
// parse base URL and convert to: $scheme, $host, $path, $query, $port, $user, $pass
extract(parse_url($baseUrl));
// if base URL contains a path remove non-directory elements from $path
if (isset($path) === true){
$path = preg_replace('#/[^/]*$#', '', $path);
}
else {
$path = '';
}
// if realtive URL starts with //
if (substr($relativeUrl, 0, 2) === '//'){
return $scheme.':'.$relativeUrl;
}
// if realtive URL starts with /
if ($relativeUrl[0] === '/'){
$path = null;
}
$abs = null;
// if realtive URL contains a user
if (isset($user) === true){
$abs .= $user;
// if realtive URL contains a password
if (isset($pass) === true){
$abs .= ':'.$pass;
}
$abs .= '#';
}
$abs .= $host;
// if realtive URL contains a port
if (isset($port) === true){
$abs .= ':'.$port;
}
$abs .= $path.'/'.$relativeUrl.(isset($query) === true ? '?'.$query : null);
// replace // or /./ or /foo/../ with /
$re = ['#(/\.?/)#', '#/(?!\.\.)[^/]+/\.\./#'];
for ($n = 1; $n > 0; $abs = preg_replace($re, '/', $abs, -1, $n)) {
}
// return absolute URL
return $scheme.'://'.$abs;
}
A web browser uses the page URL or the base tag to resolve relative URLs.
This script can resolve a URL relative to a base URL.
/** Build a URL
*
* #param array $parts An array that follows the parse_url scheme
* #return string
*/
function build_url($parts)
{
if (empty($parts['user'])) {
$url = $parts['scheme'] . '://' . $parts['host'];
} elseif(empty($parts['pass'])) {
$url = $parts['scheme'] . '://' . $parts['user'] . '#' . $parts['host'];
} else {
$url = $parts['scheme'] . '://' . $parts['user'] . ':' . $parts['pass'] . '#' . $parts['host'];
}
if (!empty($parts['port'])) {
$url .= ':' . $parts['port'];
}
if (!empty($parts['path'])) {
$url .= $parts['path'];
}
if (!empty($parts['query'])) {
$url .= '?' . $parts['query'];
}
if (!empty($parts['fragment'])) {
return $url . '#' . $parts['fragment'];
}
return $url;
}
/** Convert a relative path in to an absolute path
*
* #param string $path
* #return string
*/
function abs_path($path)
{
$path_array = explode('/', $path);
// Solve current and parent folder navigation
$translated_path_array = array();
$i = 0;
foreach ($path_array as $name) {
if ($name === '..') {
unset($translated_path_array[--$i]);
} elseif (!empty($name) && $name !== '.') {
$translated_path_array[$i++] = $name;
}
}
return '/' . implode('/', $translated_path_array);
}
/** Convert a relative URL in to an absolute URL
*
* #param string $url URL or URI
* #param string $base Absolute URL
* #return string
*/
function abs_url($url, $base)
{
$url_parts = parse_url($url);
$base_parts = parse_url($base);
// Handle the path if it is specified
if (!empty($url_parts['path'])) {
// Is the path relative
if (substr($url_parts['path'], 0, 1) !== '/') {
if (substr($base_parts['path'], -1) === '/') {
$url_parts['path'] = $base_parts['path'] . $url_parts['path'];
} else {
$url_parts['path'] = dirname($base_parts['path']) . '/' . $url_parts['path'];
}
}
// Make path absolute
$url_parts['path'] = abs_path($url_parts['path']);
}
// Use the base URL to populate the unfilled components until a component is filled
foreach (['scheme', 'host', 'path', 'query', 'fragment'] as $comp) {
if (!empty($url_parts[$comp])) {
break;
}
$url_parts[$comp] = $base_parts[$comp];
}
return build_url($url_parts);
}
Test
// Base URL
$base_url = 'https://example.com/path1/path2/path3/path4/file.ext?field1=value1&field2=value2#fragment';
// URL and URIs (_ is used to see what is coming from relative URL)
$test_urls = array(
"http://_example.com/_path1/_path2/_file.ext?_field1=_value1&_field2=_value2#_fragment", // URL
"//_example.com/_path1/_path2/_file.ext?_field1=_value1&_field2=_value2#_fragment", // URI without scheme
"//_example.com", // URI with host only
"/_path1/_path2/_file.ext?_field1=_value1&_field2=_value2#_fragment", // URI without scheme and host
"_path1/_path2/_file.ext", // URI with path only
"./../../_path1/../_path2/file.ext#_fragment", // URI with path and fragment
"?_field1=_value1&_field2=_value2#_fragment", // URI with query and fragment
"#_fragment" // URI with fragment only
);
// Expected result
$expected_urls = array(
"http://_example.com/_path1/_path2/_file.ext?_field1=_value1&_field2=_value2#_fragment",
"https://_example.com/_path1/_path2/_file.ext?_field1=_value1&_field2=_value2#_fragment",
"https://_example.com",
"https://example.com/_path1/_path2/_file.ext?_field1=_value1&_field2=_value2#_fragment",
"https://example.com/path1/path2/path3/path4/_path1/_path2/_file.ext",
"https://example.com/path1/path2/_path2/file.ext#_fragment",
"https://example.com/path1/path2/path3/path4/file.ext?_field1=_value1&_field2=_value2#_fragment",
"https://example.com/path1/path2/path3/path4/file.ext?field1=value1&field2=value2#_fragment"
);
foreach ($test_urls as $i => $url) {
$abs_url = abs_url($url, $base_url);
if ( $abs_url == $expected_urls[$i] ) {
echo "[OK] " . $abs_url . PHP_EOL;
} else {
echo "[WRONG] " . $abs_url . PHP_EOL;
}
}
Result
[OK] http://_example.com/_path1/_path2/_file.ext?_field1=_value1&_field2=_value2#_fragment
[OK] https://_example.com/_path1/_path2/_file.ext?_field1=_value1&_field2=_value2#_fragment
[OK] https://_example.com
[OK] https://example.com/_path1/_path2/_file.ext?_field1=_value1&_field2=_value2#_fragment
[OK] https://example.com/path1/path2/path3/path4/_path1/_path2/_file.ext
[OK] https://example.com/path1/path2/_path2/file.ext#_fragment
[OK] https://example.com/path1/path2/path3/path4/file.ext?_field1=_value1&_field2=_value2#_fragment
[OK] https://example.com/path1/path2/path3/path4/file.ext?field1=value1&field2=value2#_fragment
Wasn't in fact the question about converting path and not url? PHP actually has a function for this: realpath(). The only thing you should be aware of are symlinks.
Example from PHP manual:
chdir('/var/www/');
echo realpath('./../../etc/passwd') . PHP_EOL;
// Prints: /etc/passwd
echo realpath('/tmp/') . PHP_EOL;
// Prints: /tmp
A easy way to do this is using phpUri a small php library for converting relative urls to absolute.
Usage is simple:
require_once 'phpuri.php';
$absolute = phpUri::parse( $base_path )->join( $relative_path );
You don't even need to check that the path passed to join is actually relative. If it is absolute then parse will return it.
If the relative directory already exists this will do the job:
function rel2abs($relPath, $baseDir = './')
{
if ('' == trim($path))
{
return $baseDir;
}
$currentDir = getcwd();
chdir($baseDir);
$path = realpath($path);
chdir($currentDir);
return $path;
}
I used the same code from: http://nashruddin.com/PHP_Script_for_Converting_Relative_to_Absolute_URL
but I modified It a little bit so If base url contains PORT number it returns the relative URL with port number in it.
function rel2abs($rel, $base)
{
/* return if already absolute URL */
if (parse_url($rel, PHP_URL_SCHEME) != '') return $rel;
/* queries and anchors */
if ($rel[0]=='#' || $rel[0]=='?') return $base.$rel;
/* parse base URL and convert to local variables:
$scheme, $host, $path */
extract(parse_url($base));
/* remove non-directory element from path */
$path = preg_replace('#/[^/]*$#', '', $path);
/* destroy path if relative url points to root */
if ($rel[0] == '/') $path = '';
/* dirty absolute URL // with port number if exists */
if (parse_url($base, PHP_URL_PORT) != ''){
$abs = "$host:".parse_url($base, PHP_URL_PORT)."$path/$rel";
}else{
$abs = "$host$path/$rel";
}
/* replace '//' or '/./' or '/foo/../' with '/' */
$re = array('#(/\.?/)#', '#/(?!\.\.)[^/]+/\.\./#');
for($n=1; $n>0; $abs=preg_replace($re, '/', $abs, -1, $n)) {}
/* absolute URL is ready! */
return $scheme.'://'.$abs;
}
Hope this helps someone!
This function will resolve relative URL's to a given current page url in $pgurl without regex. It successfully resolves:
/home.php?example types,
same-dir nextpage.php types,
../...../.../parentdir types,
full http://example.net urls,
and shorthand //example.net urls
//Current base URL (you can dynamically retrieve from $_SERVER)
$pgurl = 'http://example.com/scripts/php/absurl.php';
function absurl($url) {
global $pgurl;
if(strpos($url,'://')) return $url; //already absolute
if(substr($url,0,2)=='//') return 'http:'.$url; //shorthand scheme
if($url[0]=='/') return parse_url($pgurl,PHP_URL_SCHEME).'://'.parse_url($pgurl,PHP_URL_HOST).$url; //just add domain
if(strpos($pgurl,'/',9)===false) $pgurl .= '/'; //add slash to domain if needed
return substr($pgurl,0,strrpos($pgurl,'/')+1).$url; //for relative links, gets current directory and appends new filename
}
function nodots($path) { //Resolve dot dot slashes, no regex!
$arr1 = explode('/',$path);
$arr2 = array();
foreach($arr1 as $seg) {
switch($seg) {
case '.':
break;
case '..':
array_pop($arr2);
break;
case '...':
array_pop($arr2); array_pop($arr2);
break;
case '....':
array_pop($arr2); array_pop($arr2); array_pop($arr2);
break;
case '.....':
array_pop($arr2); array_pop($arr2); array_pop($arr2); array_pop($arr2);
break;
default:
$arr2[] = $seg;
}
}
return implode('/',$arr2);
}
Usage Example:
echo nodots(absurl('../index.html'));
nodots() must be called after the URL is converted to absolute.
The dots function is kind of redundant, but is readable, fast, doesn't use regex's, and will resolve 99% of typical urls (if you want to be 100% sure, just extend the switch block to support 6+ dots, although I've never seen that many dots in a URL).
Hope this helps,
function url_to_absolute($baseURL, $relativeURL) {
$relativeURL_data = parse_url($relativeURL);
if (isset($relativeURL_data['scheme'])) {
return $relativeURL;
}
$baseURL_data = parse_url($baseURL);
if (!isset($baseURL_data['scheme'])) {
return $relativeURL;
}
$absoluteURL_data = $baseURL_data;
if (isset($relativeURL_data['path']) && $relativeURL_data['path']) {
if (substr($relativeURL_data['path'], 0, 1) == '/') {
$absoluteURL_data['path'] = $relativeURL_data['path'];
} else {
$absoluteURL_data['path'] = (isset($absoluteURL_data['path']) ? preg_replace('#[^/]*$#', '', $absoluteURL_data['path']) : '/') . $relativeURL_data['path'];
}
if (isset($relativeURL_data['query'])) {
$absoluteURL_data['query'] = $relativeURL_data['query'];
} else if (isset($absoluteURL_data['query'])) {
unset($absoluteURL_data['query']);
}
} else {
$absoluteURL_data['path'] = isset($absoluteURL_data['path']) ? $absoluteURL_data['path'] : '/';
if (isset($relativeURL_data['query'])) {
$absoluteURL_data['query'] = $relativeURL_data['query'];
} else if (isset($absoluteURL_data['query'])) {
$absoluteURL_data['query'] = $absoluteURL_data['query'];
}
}
if (isset($relativeURL_data['fragment'])) {
$absoluteURL_data['fragment'] = $relativeURL_data['fragment'];
} else if (isset($absoluteURL_data['fragment'])) {
unset($absoluteURL_data['fragment']);
}
$absoluteURL_path = ltrim($absoluteURL_data['path'], '/');
$absoluteURL_path_parts = array();
for ($i = 0, $i2 = 0; $i < strlen($absoluteURL_path); $i++) {
if (isset($absoluteURL_path_parts[$i2])) {
$absoluteURL_path_parts[$i2] .= $absoluteURL_path[$i];
} else {
$absoluteURL_path_parts[$i2] = $absoluteURL_path[$i];
}
if ($absoluteURL_path[$i] == '/') {
$i2++;
}
}
reset($absoluteURL_path_parts);
while (true) {
if (rtrim(current($absoluteURL_path_parts), '/') == '.') {
unset($absoluteURL_path_parts[key($absoluteURL_path_parts)]);
continue;
} else if (rtrim(current($absoluteURL_path_parts), '/') == '..') {
if (prev($absoluteURL_path_parts) !== false) {
unset($absoluteURL_path_parts[key($absoluteURL_path_parts)]);
} else {
reset($absoluteURL_path_parts);
}
unset($absoluteURL_path_parts[key($absoluteURL_path_parts)]);
continue;
}
if (next($absoluteURL_path_parts) === false) {
break;
}
}
$absoluteURL_data['path'] = '/' . implode('', $absoluteURL_path_parts);
$absoluteURL = isset($absoluteURL_data['scheme']) ? $absoluteURL_data['scheme'] . ':' : '';
$absoluteURL .= (isset($absoluteURL_data['user']) || isset($absoluteURL_data['host'])) ? '//' : '';
$absoluteURL .= isset($absoluteURL_data['user']) ? $absoluteURL_data['user'] : '';
$absoluteURL .= isset($absoluteURL_data['pass']) ? ':' . $absoluteURL_data['pass'] : '';
$absoluteURL .= isset($absoluteURL_data['user']) ? '#' : '';
$absoluteURL .= isset($absoluteURL_data['host']) ? $absoluteURL_data['host'] : '';
$absoluteURL .= isset($absoluteURL_data['port']) ? ':' . $absoluteURL_data['port'] : '';
$absoluteURL .= isset($absoluteURL_data['path']) ? $absoluteURL_data['path'] : '';
$absoluteURL .= isset($absoluteURL_data['query']) ? '?' . $absoluteURL_data['query'] : '';
$absoluteURL .= isset($absoluteURL_data['fragment']) ? '#' . $absoluteURL_data['fragment'] : '';
return $absoluteURL;
}
You can use this composer package to do that.
https://packagist.org/packages/wa72/url
composer require wa72/url
Parse URL strings to objects
add and modify query parameters
set and modify any part of the url
test for equality of URLs with query parameters in a PHP-fashioned
way
supports protocol-relative urls
convert absolute, host-relative and protocol-relative urls to
relative and vice versa
This make suit of #jordansstephens's answer that doesn't support absolute url begins with '//'.
function rel2abs($rel, $base)
{
/* return if already absolute URL */
if (parse_url($rel, PHP_URL_SCHEME) != '') return $rel;
/* Url begins with // */
if($rel[0] == '/' && $rel[1] == '/'){
return 'https:' . $rel;
}
/* queries and anchors */
if ($rel[0]=='#' || $rel[0]=='?') return $base.$rel;
/* parse base URL and convert to local variables:
$scheme, $host, $path */
extract(parse_url($base));
/* remove non-directory element from path */
$path = preg_replace('#/[^/]*$#', '', $path);
/* destroy path if relative url points to root */
if ($rel[0] == '/') $path = '';
/* dirty absolute URL */
$abs = "$host$path/$rel";
/* replace '//' or '/./' or '/foo/../' with '/' */
$re = array('#(/\.?/)#', '#/(?!\.\.)[^/]+/\.\./#');
for($n=1; $n>0; $abs=preg_replace($re, '/', $abs, -1, $n)) {}
/* absolute URL is ready! */
return $scheme.'://'.$abs;
}

Replacing last char (string) using regex or DOMDocument

I'm using one small script to convert from absolute links to relative ones. It is working but it needs improvement. Not sure how to proceed. Please have a look at part of the script used for this.
Script:
public function links($path) {
$old_url = 'http://test.dev/';
$dir_handle = opendir($path);
while($item = readdir($dir_handle)) {
$new_path = $path."/".$item;
if(is_dir($new_path) && $item != '.' && $item != '..') {
$this->links($new_path);
}
// it is a file
else{
if($item != '.' && $item != '..')
{
$new_url = '';
$depth_count = 1;
$folder_depth = substr_count($new_path, '/');
while($depth_count < $folder_depth){
$new_url .= '../';
$depth_count++;
}
$file_contents = file_get_contents($new_path);
$doc = new DOMDocument;
#$doc->loadHTML($file_contents);
foreach ($doc->getElementsByTagName('a') as $link) {
if (substr($link, -1) == "/"){
$link->setAttribute('href', $link->getAttribute('href').'/index.html');
}
}
$doc->saveHTML();
$file_contents = str_replace($old_url,$new_url,$file_contents);
file_put_contents($new_path,$file_contents);
}
}
}
}
As you can see I've added inside while loop that DOMDocument but it doesn't work. What I'm trying to achieve here is to add for every link at the end index.html if last char in that link is /
What am I doing wrong?
Thank you.
Is this what you want?
$file_contents = file_get_contents($new_path);
$dom = new DOMDocument();
$dom->loadHTML($file_contents);
$xpath = new DOMXPath($dom);
$links = $xpath->query("//a");
foreach ($links as $link) {
$href = $link->getAttribute('href');
if (substr($href, -1) === '/') {
$link->setAttribute('href', $href."index.html");
}
}
$new_file_content = $dom->saveHTML();
# save this wherever you want
See a demo on ideone.com.
Hint: Your call to $dom->saveHTML() leads to nowhere (ie there's no variable capturing the output).

The PHP crawler I am using has a memory leak, what is causing this?

I am using a PHP crawler that has a memory leak. It is good for the first ~3125 links, then it runs out of memory.I tried getting rid of the MySQL insert, but that did not change anything. Can someone help me diagnose this problem? Thank you so much.
<?php
include $_SERVER['DOCUMENT_ROOT'] . '/config.php';
ini_set('max_execution_time', 0);
// USAGE
$startURL = $your_url;
$depth = 9999;
$crawler = new crawler($startURL, $depth);
// Exclude path with the following structure to be processed
$crawler->addFilterPath('customer/account/login/referer');
$crawler->run();
class crawler
{
protected $_url;
protected $_depth;
protected $_host;
protected $_seen = array();
protected $_filter = array();
public function __construct($url, $depth = 5)
{
$this->_url = $url;
$this->_depth = $depth;
$parse = parse_url($url);
$this->_host = $parse['host'];
}
protected function _processAnchors($content, $url, $depth)
{
$dom = new DOMDocument('1.0');
#$dom->loadHTML($content);
$anchors = $dom->getElementsByTagName('a');
foreach ($anchors as $element) {
$href = $element->getAttribute('href');
if (0 !== strpos($href, 'http')) {
$path = '/' . ltrim($href, '/');
if (extension_loaded('http')) {
$href = http_build_url($url, array('path' => $path));
} else {
$parts = parse_url($url);
$href = $parts['scheme'] . '://';
if (isset($parts['user']) && isset($parts['pass'])) {
$href .= $parts['user'] . ':' . $parts['pass'] . '#';
}
$href .= $parts['host'];
if (isset($parts['port'])) {
$href .= ':' . $parts['port'];
}
$href .= $path;
}
}
// Crawl only link that belongs to the start domain
$this->crawl_page($href, $depth - 1);
}
}
protected function _getContent($url)
{
$handle = curl_init($url);
curl_setopt($handle, CURLOPT_RETURNTRANSFER, TRUE);
$response = curl_exec($handle);
curl_close($handle);
return array($response);
}
protected function _printResult($url, $depth)
{
ob_end_flush();
$currentDepth = $this->_depth - $depth;
$count = count($this->_seen);
echo "$url <br>";
include $_SERVER['DOCUMENT_ROOT'] . '/config.php';
$databaseconnect = new PDO("mysql:dbname=DB_NAME;host=$mysqlhost;charset=utf8", $mysqlusername, $mysqlpassword);
$statement = $databaseconnect->prepare("INSERT INTO data(url,name) VALUES(:url,:name)");
$statement->execute(array(':url' => $url,
':name' => $url));
ob_start();
flush();
}
protected function isValid($url, $depth)
{
if (strpos($url, $this->_host) === false
|| $depth === 0
|| isset($this->_seen[$url])
) {
return false;
}
foreach ($this->_filter as $excludePath) {
if (strpos($url, $excludePath) !== false) {
return false;
}
}
return true;
}
public function crawl_page($url, $depth)
{
if (!$this->isValid($url, $depth)) {
return;
}
// add to the seen URL
$this->_seen[$url] = true;
// get Content and Return Code
list($content) = $this->_getContent($url);
// print Result for current Page
$this->_printResult($url, $depth);
// process subPages
$this->_processAnchors($content, $url, $depth);
}
public function addFilterPath($path)
{
$this->_filter[] = $path;
}
public function run()
{
$this->crawl_page($this->_url
, $this->_depth);
}
}
?>
I'm not sure if this classifies as a memory leak exactly. You are essentially using recursion without a terminating case. Before the crawl_page() method finishes it calls _processAnchors(), which in turn may call crawl_page() again if it finds any links (very likely). Every recursive call eats up more memory because the originating crawl_page() call (and most thereafter) can't be removed from the call stack until all of its recursive calls terminate.

PHP convert external relative path to absolute path

I am trying to figure out how to convert an "external relative path" to an absolute one:
I'd really like a function that will do the following:
$path = "/search?q=query";
$host = "http://google.com";
$abspath = reltoabs($host, $path);
And have $abspath equal to "http://google.com/search?q=query"
Another example:
$path = "top.html";
$host = "www.example.com/documentation";
$abspath = reltoabs($host, $path);
And have $abspath equal to "http://www.example.com/documentation/top.html"
The problem is that it is not guaranteed to be in that format, and it could already be absolute, or be pointing to a different host entirely, and I'm not quite sure how to approach this.
Thanks.
You should try the PECL function http_build_url
http://php.net/manual/en/function.http-build-url.php
So there are three cases:
proper URL
no protocol
no protocol and no domain
Example code (untested):
if (preg_match('#^http(?:s)?://#i', $userurl))
$url = preg_replace('#^http(s)?://#i', 'http$1://', $userurl); //protocol lowercase
//deem to have domain if a dot is found before a /
elseif (preg_match('#^[^/]+\\.[^/]+#', $useurl)
$url = "http://".$useurl;
else { //no protocol or domain
$url = "http://default.domain/" . (($useurl[0] != "/") ? "/" : "") . $useurl;
}
$url = filter_var($url, FILTER_VALIDATE_URL);
if ($url === false)
die("User gave invalid url").
It appears I have solved my own problem:
function reltoabs($host, $path) {
$resulting = array();
$hostparts = parse_url($host);
$pathparts = parse_url($path);
if (array_key_exists("host", $pathparts)) return $path; // Absolute
// Relative
$opath = "";
if (array_key_exists("scheme", $hostparts)) $opath .= $hostparts["scheme"] . "://";
if (array_key_exists("user", $hostparts)) {
if (array_key_exists("pass", $hostparts)) $opath .= $hostparts["user"] . ":" . $hostparts["pass"] . "#";
else $opath .= $hostparts["user"] . "#";
} elseif (array_key_exists("pass", $hostparts)) $opath .= ":" . $hostparts["pass"] . "#";
if (array_key_exists("host", $hostparts)) $opath .= $hostparts["host"];
if (!array_key_exists("path", $pathparts) || $pathparts["path"][0] != "/") {
$dirname = explode("/", $hostparts["path"]);
$opath .= implode("/", array_slice($dirname, 0, count($dirname) - 1)) . "/" . basename($pathparts["path"]);
} else $opath .= $pathparts["path"];
if (array_key_exists("query", $pathparts)) $opath .= "?" . $pathparts["query"];
if (array_key_exists("fragment", $pathparts)) $opath .= "#" . $pathparts["fragment"];
return $opath;
}
Which seems to work pretty well, for my purposes.

Categories