Here is my code which is a partially based on a few different codes that you can find easily in various places if googled. I'm trying to count the internal and external links, all links and ( TO DO .nofollow ) links on a any webpage. This is what I have till now. Most of the results are correct, some generic calls gives me a weird results though, and I still need to do .nofollow and perhaps _blank as well. If you care to comment or add/change anything with bit of logic explanation then please do so, it will be very appreciated.
<?php
// transform to absolute path function...
function path_to_absolute($rel, $base)
{
/* return if already absolute URL */
if (parse_url($rel, PHP_URL_SCHEME) != '') return $rel;
/* queries and anchors */
if ($rel[0]=='#' || $rel[0]=='?') return $base.$rel;
/* parse base URL and convert to local variables:
$scheme, $host, $path */
extract(parse_url($base));
/* remove non-directory element from path */
$path = preg_replace('#/[^/]*$#', '', $path);
/* destroy path if relative url points to root */
if ($rel[0] == '/') $path = '';
/* dirty absolute URL */
$abs = "$host$path/$rel";
/* replace '//' or '/./' or '/foo/../' with '/' */
$re = array('#(/\.?/)#', '#/(?!\.\.)[^/]+/\.\./#');
for($n=1; $n>0; $abs=preg_replace($re, '/', $abs, -1, $n)) {}
/* absolute URL is ready! */
return $scheme.'://'.$abs;
}
// count zero begins
$intnumLinks = 0;
$extnumLinks = 0;
$nfnumLinks = 0;
$allnumLinks = 0;
// get url file
$url = $_REQUEST['url'];
// get contents of url file
$html = file_get_contents($url);
// http://stackoverflow.com/questions/138313/how-to-extract-img-src-title-and-alt-from-html-using-php
// loading DOM document
$doc=new DOMDocument();
#$doc->loadHTML($html);
$xml=simplexml_import_dom($doc); // just to make xpath more simple
$strings=$xml->xpath('//a');
foreach ($strings as $string) {
$aa = path_to_absolute( $string[href], $url, true );
$a = parse_url($aa, PHP_URL_HOST);
$a = str_replace("www.", "", $a);
$b = parse_url($url, PHP_URL_HOST);
if($a == $b){
echo 'call-host: ' . $b . '<br>';
echo 'type: int </br>';
echo 'title: ' . $string[0] . '<br>';
echo 'url: ' . $string['href'] . '<br>';
echo 'host: ' . $a . '<br><br>';
$intnumLinks++;
}else{
echo 'call-host: ' . $b . '<br>';
echo 'type: ext </br>';
echo 'title: ' . $string[0] . '<br>';
echo 'url: ' . $string['href'] . '<br>';
echo 'host: ' . $a . '<br><br>';
$extnumLinks++;
}
$allnumLinks++;
}
// count results
echo "<br>";
echo "Count int: $intnumLinks <br>";
echo "Count ext: $extnumLinks <br>";
echo "Count nf: $nfnumLinks <br>";
echo "Count all: $allnumLinks <br>";
?>
Consider this post as closed. At first I wanted to delete this post but then again someone might use this code for his work.
Related
I have the following code :
function removeFilename($url)
{
$file_info = pathinfo($url);
return isset($file_info['extension'])
? str_replace($file_info['filename'] . "." . $file_info['extension'], "", $url)
: $url;
}
$url1 = "http://website.com/folder/filename.php";
$url2 = "http://website.com/folder/";
$url3 = "http://website.com/";
echo removeFilename($url1); //outputs http://website.com/folder/
echo removeFilename($url2);//outputs http://website.com/folder/
echo removeFilename($url3);//outputs http:///
Now my problem is that when there is only only a domain without folders or filenames my function removes website.com too.
My idea is there is any way on php to tell my function to do the work only after third slash or any other solutions you think useful.
UPDATED : ( working and tested )
<?php
function removeFilename($url)
{
$parse_file = parse_url($url);
$file_info = pathinfo($parse_file['path']);
return isset($file_info['extension'])
? str_replace($file_info['filename'] . "." . $file_info['extension'], "", $url)
: $url;
}
$url1 = "http://website.com/folder/filename.com";
$url2 = "http://website.org/folder/";
$url3 = "http://website.com/";
echo removeFilename($url1); echo '<br/>';
echo removeFilename($url2); echo '<br/>';
echo removeFilename($url3);
?>
Output:
http://website.com/folder/
http://website.org/folder/
http://website.com/
Sounds like you are wanting to replace a substring and not the whole thing. This function might help you:
http://php.net/manual/en/function.substr-replace.php
Since filename is at last slash you can use substr and str_replace to remove file name from path.
$PATH = "http://website.com/folder/filename.php";
$file = substr( strrchr( $PATH, "/" ), 1) ;
echo $dir = str_replace( $file, '', $PATH ) ;
OUTPUT
http://website.com/folder/
pathinfo cant recognize only domain and file name. But if without filename url is ended by slash
$a = array(
"http://website.com/folder/filename.php",
"http://website.com/folder/",
"http://website.com",
);
foreach ($a as $item) {
$item = explode('/', $item);
if (count($item) > 3)
$item[count($item)-1] ='';;
echo implode('/', $item) . "\n";
}
result
http://website.com/folder/
http://website.com/folder/
http://website.com
Close to the answer of splash58
function getPath($url) {
$item = explode('/', $url);
if (count($item) > 3) {
if (strpos($item[count($item) - 1], ".") === false) {
return $url;
}
$item[count($item)-1] ='';
return implode('/', $item);
}
return $url;
}
is there any php functions, to sanitize link+path?
i.e.
http://example.com/fold1/fold2/fold3/../../././MyFile.HTML
to
http://example.com/fold1/MyFile.HTML
so, i want remove dots,but maintain the suitable(relative) correct path.
I've found so far, is :
echo ConvertDotedPathToNormalUrl('http://example.com/directory/.././pageee.html');
code:
function ConvertDotedPathToNormalUrl($url){
$firstType = '/(.*)\/((?:(?!\.\.).)+)\/\.\.\//si';
preg_match($firstType,$url,$result);
if (!empty($result[2])){
$url = str_replace('/'.$result[2].'/..','',$url);
if ( strstr($url,'../')){$url= ConvertDotedPathToNormalUrl($url);}
}
$url = str_replace('/./','/',$url); $url = str_replace('://','|||',$url);$url = str_replace('//','/',$url);$url = str_replace('|||','://',$url);
return $url;
}
p.s. but not, it converts
You can
1) get the $path using parse_url(..).
2) get the $webroot = $_SERVER['DOCUMENT_ROOT'];
3) get the $zrealpath = realpath($webroot . $path);
<?php
define ('CRLF', "<br />\n");
$url = 'http://example.com/fold1/fold2/fold3/../../././MyFile.HTML';
$parsed = parse_url($url);
echo '---- vardump($parsed):', CRLF; // for education
zvardump($parsed);
$webroot = $_SERVER['DOCUMENT_ROOT'];
echo 'webroot = ', $webroot, CRLF;
$path = $parsed['path'];
echo 'path = ', $path, CRLF;
$zrealpath = realpath($webroot . $path);
echo 'realpath = ', $zrealpath, CRLF;
function zvardump($var1) {
ob_start();
echo "<pre style=\"margin:0;\">\n";
var_dump($var1);
echo "</pre>\n";
$zoutput = ob_get_contents();
ob_end_clean();
echo str_replace("=>\n ", " => ", $zoutput);
}
?>
i can with php code Scraping title and url from google search results now how to get descriptions
$url = 'http://www.google.com/search?hl=en&safe=active&tbo=d&site=&source=hp&q=Beautiful+Bangladesh&oq=Beautiful+Bangladesh';
$html = file_get_html($url);
$linkObjs = $html->find('h3.r a');
foreach ($linkObjs as $linkObj) {
$title = trim($linkObj->plaintext);
$link = trim($linkObj->href);
// if it is not a direct link but url reference found inside it, then extract
if (!preg_match('/^https?/', $link) && preg_match('/q=(.+)&sa=/U', $link, $matches) && preg_match('/^https?/', $matches[1])) {
$link = $matches[1];
} else if (!preg_match('/^https?/', $link)) { // skip if it is not a valid link
continue;
}
echo '<p>Title: ' . $title . '<br />';
echo 'Link: ' . $link . '</p>';
}
The above code gives the following output
Title: Natural Beauties - Bangladesh Photo Gallery
Link: http://www.photo.com.bd/Beauties/
Now I want the following output
Title: Natural Beauties - Bangladesh Photo Gallery
Link: http://www.photo.com.bd/Beauties/
description : photo.com.bd is a website for creative photographers from Bangladesh, mainly for amateur ... Natural-Beauty-of-Bangladesh_Flower ยท fishing on ... BEAUTY-4.
include("simple_html_dom.php");
$in = "Beautiful Bangladesh";
$in = str_replace(' ','+',$in); // space is a +
$url = 'http://www.google.com/search?hl=en&tbo=d&site=&source=hp&q='.$in.'&oq='.$in.'';
print $url."<br>";
$html = file_get_html($url);
$i=0;
$linkObjs = $html->find('h3.r a');
foreach ($linkObjs as $linkObj) {
$title = trim($linkObj->plaintext);
$link = trim($linkObj->href);
// if it is not a direct link but url reference found inside it, then extract
if (!preg_match('/^https?/', $link) && preg_match('/q=(.+)&sa=/U', $link, $matches) && preg_match('/^https?/', $matches[1])) {
$link = $matches[1];
} else if (!preg_match('/^https?/', $link)) { // skip if it is not a valid link
continue;
}
$descr = $html->find('span.st',$i); // description is not a child element of H3 thereforce we use a counter and recheck.
$i++;
echo '<p>Title: ' . $title . '<br />';
echo 'Link: ' . $link . '<br />';
echo 'Description: ' . $descr . '</p>';
}
I am trying to use $value inside the $feed_title variable. And generate all 200 $feed_title variables.
What I am trying to accomplish would look like this:
Feed Url: http://something.com/term/###/feed
Feed Title: Some Title
Where the ### varies from 100-300.
I am using the following code, and getting the urls, but not sure how to get the titles for each feed:
$arr = range(100,300);
foreach($arr as $key=>$value)
{
unset($arr[$key + 1]);
$feed_title = simplexml_load_file('http://www.something.com/term/'
. ??? . '/0/feed');
echo 'Feed URL: <a href="http://www.something.com/term/' . $value
. '/0/feed">http://www.something.com//term/' . $value
. '/0/feed</a><br/> Feed Category: ' . $feed_title->channel[0]->title
. '<br/>';
}
Do I need another loop inside of the foreach? Any help is appreciated.
If you want to get the title of a page, use this function:
function getTitle($Url){
$str = file_get_contents($Url);
if(strlen($str)>0){
preg_match("/\<title\>(.*)\<\/title\>/",$str,$title);
return $title[1];
}
}
Here's some sample code:
<?php
function getTitle($Url){
$str = file_get_contents($Url);
if(strlen($str)>0){
preg_match("/\<title\>(.*)\<\/title\>/",$str,$title);
return $title[1];
}
}
$arr = range(300,305);
foreach($arr as $value)
{
$feed_title = getTitle('http://www.translate.com/portuguese/feed/' . $value);
echo 'Feed URL: http://www.translate.com/portuguese/feed/' . $value . '<br/>
Feed Category: ' . $feed_title . '<br/>';
}
?>
This gets the title from translate.com pages. I just limited the number of pages for faster execution.
Just change the getTitle to your function if you want to get the title from xml.
Instead of using an array created with range, use a for loop as follows:
for($i = 100; $i <= 300; $i++){
$feed = simplexml_load_file('http://www.something.com/term/' . $i . '/0/feed');
echo 'Feed URL: http://www.something.com/term/' . $i . '/0/feed/ <br /> Feed category: ' . $feed->channel[0]->title . '<br/>';
}
The variable ' $return10 ' (for example) is a url, and I need to append ' &var2=example ' to the end. Like this:
header( "Location: $return10&var2=example" );
header ("Content-Length: 0");
exit;
The challenge is not knowing if the url contained in ' $return10 ' will already have a query string.
Choice A) If I use ' &var2=example ' , then sometimes the final url will be ' ://example.com&var2=example ' , with no '?' to start the query string.
Choice B) If I use ' ?var2=example ' , then sometimes the final url will contain two "?"'s starting two different query strings??
Is there a third choice? How would you cover both possibilities using "the correct code?" Thank you.
Create a function that will append your query code if there is one... And add it if there isn't...
function append_query($url, $query) {
// Fix for relative scheme URL
$relativeScheme = false;
if(substr($url, 0, 3) == '://') {
$relativeScheme = true;
$url = 'a' . $url;
}
$newUrl = http_build_url($url, array('query' => $query), HTTP_URL_JOIN_QUERY);
if($relativeScheme) {
return substr($newUrl, 1);
}
return $newUrl;
}
header('Location: ' . append_query($return10, 'var2=example'));
This will work regardless of if your query has a fragment or not.
EDIT: Fixed for relative scheme URL.
If your PHP does not have http_build_url() available (ie.: PECL extension not installed), here is a pure PHP version of it which does not require the extension.
define('HTTP_URL_REPLACE', 1); // Replace every part of the first URL when there's one of the second URL
define('HTTP_URL_JOIN_PATH', 2); // Join relative paths
define('HTTP_URL_JOIN_QUERY', 4); // Join query strings
define('HTTP_URL_STRIP_USER', 8); // Strip any user authentication information
define('HTTP_URL_STRIP_PASS', 16); // Strip any password authentication information
define('HTTP_URL_STRIP_AUTH', 32); // Strip any authentication information
define('HTTP_URL_STRIP_PORT', 64); // Strip explicit port numbers
define('HTTP_URL_STRIP_PATH', 128); // Strip complete path
define('HTTP_URL_STRIP_QUERY', 256); // Strip query string
define('HTTP_URL_STRIP_FRAGMENT', 512); // Strip any fragments (#identifier)
define('HTTP_URL_STRIP_ALL', 1024); // Strip anything but scheme and host
// Build an URL
// The parts of the second URL will be merged into the first according to the flags argument.
//
// #param mixed (Part(s) of) an URL in form of a string or associative array like parse_url() returns
// #param mixed Same as the first argument
// #param int A bitmask of binary or'ed HTTP_URL constants (Optional)HTTP_URL_REPLACE is the default
// #param array If set, it will be filled with the parts of the composed url like parse_url() would return
function http_build_url($url, $parts = array (), $flags = HTTP_URL_REPLACE, &$new_url = false) {
$keys = array (
'user',
'pass',
'port',
'path',
'query',
'fragment'
);
// HTTP_URL_STRIP_ALL becomes all the HTTP_URL_STRIP_Xs
if ($flags & HTTP_URL_STRIP_ALL) {
$flags |= HTTP_URL_STRIP_USER;
$flags |= HTTP_URL_STRIP_PASS;
$flags |= HTTP_URL_STRIP_PORT;
$flags |= HTTP_URL_STRIP_PATH;
$flags |= HTTP_URL_STRIP_QUERY;
$flags |= HTTP_URL_STRIP_FRAGMENT;
}
// HTTP_URL_STRIP_AUTH becomes HTTP_URL_STRIP_USER and HTTP_URL_STRIP_PASS
else if ($flags & HTTP_URL_STRIP_AUTH) {
$flags |= HTTP_URL_STRIP_USER;
$flags |= HTTP_URL_STRIP_PASS;
}
// Parse the original URL
$parse_url = parse_url($url);
// Scheme and Host are always replaced
if (isset($parts['scheme']))
$parse_url['scheme'] = $parts['scheme'];
if (isset($parts['host']))
$parse_url['host'] = $parts['host'];
// (If applicable) Replace the original URL with it's new parts
if ($flags & HTTP_URL_REPLACE) {
foreach ($keys as $key) {
if (isset($parts[$key]))
$parse_url[$key] = $parts[$key];
}
} else {
// Join the original URL path with the new path
if (isset($parts['path']) && ($flags & HTTP_URL_JOIN_PATH)) {
if (isset($parse_url['path']))
$parse_url['path'] = rtrim(str_replace(basename($parse_url['path']), '', $parse_url['path']), '/') . '/' . ltrim($parts['path'], '/');
else
$parse_url['path'] = $parts['path'];
}
// Join the original query string with the new query string
if (isset($parts['query']) && ($flags & HTTP_URL_JOIN_QUERY)) {
if (isset($parse_url['query']))
$parse_url['query'] .= '&' . $parts['query'];
else
$parse_url['query'] = $parts['query'];
}
}
// Strips all the applicable sections of the URL
// Note: Scheme and Host are never stripped
foreach ($keys as $key) {
if ($flags & (int)constant('HTTP_URL_STRIP_' . strtoupper($key)))
unset($parse_url[$key]);
}
$new_url = $parse_url;
return ((isset($parse_url['scheme'])) ? $parse_url['scheme'] . '://' : '') . ((isset($parse_url['user'])) ? $parse_url['user'] . ((isset($parse_url['pass'])) ? ':' . $parse_url['pass'] : '') . '#' : '')
. ((isset($parse_url['host'])) ? $parse_url['host'] : '') . ((isset($parse_url['port'])) ? ':' . $parse_url['port'] : '') . ((isset($parse_url['path'])) ? $parse_url['path'] : '')
. ((isset($parse_url['query'])) ? '?' . $parse_url['query'] : '') . ((isset($parse_url['fragment'])) ? '#' . $parse_url['fragment'] : '');
}
Take a look at http://php.net/manual/en/function.parse-url.php and http://www.php.net/manual/en/function.http-build-url.php.
Now you can do something like this:
<?php
$return10 = '... some url here ...';
$newUrl = http_build_url(
$return10,
array('query' => 'var2=example'),
HTTP_URL_JOIN_QUERY
);
?>
You can construct the URI seperately:
if(!strpos($uri, "?"))
$uri .= "&var2=example"
else
$uri .= "?var2=example"
header("Location: $uri");