working with links : identifying external links and full address of links - php

i'm trying to create a sitemap for my website
so basically i scan the homepage for links
and extract the links and do the same thing recursively for extracted links
function get_contents($url = '' ) {
if($url == '' ) { $url = $this->base_url; }
$curl = new cURL;
$content = $curl->get($url);
$this->get_links($content);
}
public function get_links($contents){
$DOM = new DOMDocument();
$DOM->loadHTML($contents);
$a = $DOM->getElementsByTagName('a');
foreach($a as $link){
$h = $link->getAttribute('href');
$l = $this->base.'/'.$h;
$this->links[] = $l ;
$this->get_contents($l);
}
}
it works fine but there are couple of problems
1-
i get some links ike
www.mysite.com/http://www.external.com
i can do something like
if( stripos( $link , 'http') !== false
||
stripos( $link , 'www.') !== false
||
stripos( $link , 'https') !== false
)
{
if(stripos( $link , 'mysite.com') !== false)
{
//ignor this link (yeah i suck at regex and string mapping)
}
}
but it's seems very complicated and slow , is there any standard and clean way to find out if a link is a external link ?
2 -
is there any way to deal with relative paths ?
i get some thing like
www.mysite.com/../Domain/List3.html
obviusly this isn't right
i can remove (../) from link but it might not work with all links
is there anyway to find out full address of a link ?

For relative paths, you could take a look at realpath()
use parse_url() to get domain for example so you can easy check
if the domain is equal to your domain. Notice that parse_url() requires a SCHEME to be defined
so maybe add http:// if there is no http[s].

Related

Can we Use Replace By str_replace in a code fetched from remote url

i got Source Code From Remote Url Like This
$f = file_get_contents("http://www.example.com/abc/");
$str=htmlspecialchars( $f );
echo $str;
in that code i want to replace/extract any url which is like
href="/m/offers/"
i want to replace that code/link as
href="www.example.com/m/offers/"
for that i used
$newstr=str_replace('href="/m/offers/"','href="www/exmple.com/m/offers/',$str);
echo $newstr;
but this is not replacing anything now i want to know 1st ) can i replace by str_replace ,in the code which is fetched from remote url and if 'yes' how ...? if 'no' any other solution ?
There will not be any " in your $str because htmlspecialchars() would have converted them all to be " before it got to your str_replace.
I start assuming all href attributes belong to tags.
Since we know if all tags are written in the same way. instead of opting for regular expressions, I will use an interpreter to facilitate the extraction process
<?php
use Symfony\Component\DomCrawler\Crawler;
$base = "http://www.example.com"
$url = $base . "/abc/";
$html = file_get_contents($url);
$crawler = new Crawler($html);
$links = array();
$raw_links = array();
$offers = array();
foreach($crawler->filter('a') as $atag) {
$raw_links[] = $raw_link = $atag->attr('href');
$links[] = $link = str_replce($base, '', $raw_link);
if (strpos($link, 'm/offers') !== false) {
$offers[] = $link;
}
}
now you have all the raw links, relative links and offerslinks
I use the DomCrawler component

Adding _blank targets to external links

I just want to know how can I add a _blank target type to a link if a link is pointing to an external domain (_self for internals). I was doing this by checking the url but it was really hard coded and not reusable for other sites.
Do you have any idea of how to do it properly with PHP ?
$target_type=(strpos($ref, $_SERVER['HTTP_HOST'])>-1
|| strpos($ref,'/')===0? '_self' : '_blank');
if ($ref<>'#' && substr($ref,0,4)<>'http') $ref='http://'.$ref;
$array['href']=$ref;
if (substr($ref,0,1)<>'#') $array['target']= $target_type;
$array['rel']='nofollow';
if (empty($array['text'])) $array['text']=str_replace('http://','',$ref);
This is only working for the main domain, but when using domain.com/friendlyurl/, is not working.
Thanks in advance
NOTE : Links can contain whether http:// protocol or not and they use to be absolute links. Links are added by users in the system
The easiest way is to use parse_url function. For me it should look like this:
<?php
$link = $_GET['link'];
$urlp = parse_url($link);
$target = '_self';
if (isset($urlp['host']) && $urlp['host'] != $_SERVER['HTTP_HOST'])
{
if (preg_replace('#^(www.)(.*)#D','$2',$urlp['host']) != $_SERVER['HTTP_HOST'] && preg_replace('#^(www.)(.*)#D','$2',$_SERVER['HTTP_HOST']) != $urlp['host'])
$target = '_blank';
}
$anchor = 'LINK';
// creating html code
echo $target.'<br>';
echo '' . $link . '';
In this code I use $_GET['link'] variable, but you should use your own link from as a value of $link. You can check this script here: http://kolodziej.in/help/link_target.php?link=http://blog.kolodziej.in/2013/06/i-know-jquery-not-javascript/ (return _blank link), http://kolodziej.in/help/link_target.php?link=http://kolodziej.in/ (return _self link).

How to fetch rss feed url of a website using php?

I need to find the rss feed url of a website programmatically.
[Either using php or jquery]
The general process has already been answered (Quentin, DOOManiac), so some code (Demo):
<?php
$location = 'http://hakre.wordpress.com/';
$html = file_get_contents($location);
echo getRSSLocation($html, $location); # http://hakre.wordpress.com/feed/
/**
* #link http://keithdevens.com/weblog/archive/2002/Jun/03/RSSAuto-DiscoveryPHP
*/
function getRSSLocation($html, $location){
if(!$html or !$location){
return false;
}else{
#search through the HTML, save all <link> tags
# and store each link's attributes in an associative array
preg_match_all('/<link\s+(.*?)\s*\/?>/si', $html, $matches);
$links = $matches[1];
$final_links = array();
$link_count = count($links);
for($n=0; $n<$link_count; $n++){
$attributes = preg_split('/\s+/s', $links[$n]);
foreach($attributes as $attribute){
$att = preg_split('/\s*=\s*/s', $attribute, 2);
if(isset($att[1])){
$att[1] = preg_replace('/([\'"]?)(.*)\1/', '$2', $att[1]);
$final_link[strtolower($att[0])] = $att[1];
}
}
$final_links[$n] = $final_link;
}
#now figure out which one points to the RSS file
for($n=0; $n<$link_count; $n++){
if(strtolower($final_links[$n]['rel']) == 'alternate'){
if(strtolower($final_links[$n]['type']) == 'application/rss+xml'){
$href = $final_links[$n]['href'];
}
if(!$href and strtolower($final_links[$n]['type']) == 'text/xml'){
#kludge to make the first version of this still work
$href = $final_links[$n]['href'];
}
if($href){
if(strstr($href, "http://") !== false){ #if it's absolute
$full_url = $href;
}else{ #otherwise, 'absolutize' it
$url_parts = parse_url($location);
#only made it work for http:// links. Any problem with this?
$full_url = "http://$url_parts[host]";
if(isset($url_parts['port'])){
$full_url .= ":$url_parts[port]";
}
if($href{0} != '/'){ #it's a relative link on the domain
$full_url .= dirname($url_parts['path']);
if(substr($full_url, -1) != '/'){
#if the last character isn't a '/', add it
$full_url .= '/';
}
}
$full_url .= $href;
}
return $full_url;
}
}
}
return false;
}
}
See: RSS auto-discovery with PHP (archived copy).
This is something a lot more involved than just pasting some code here. But I can point you in the right direction for what you need to do.
First you need to fetch the page
Parse the string you get back looking for the RSS Autodiscovery Meta tag. You can either map the whole document out as XML and use DOM traversal, but I would just use a regular expression.
Extract the href portion of the tag and you now have the URL to the RSS feed.
The rules for making RSS discoverable are fairly well documented. You just need to parse the HTML and look for the elements described.
A slightly smaller function that will grab the first available feed, whether it is rss or atom (most blogs have two options - this grabs the first preference).
public function getFeedUrl($url){
if(#file_get_contents($url)){
preg_match_all('/<link\srel\=\"alternate\"\stype\=\"application\/(?:rss|atom)\+xml\"\stitle\=\".*href\=\"(.*)\"\s\/\>/', file_get_contents($url), $matches);
return $matches[1][0];
}
return false;
}

PHP: How to get base URL from HTML page

I'm struggling with figuring out how to do this. I have an absolute URL to an HTML page, and I need to get the base URL for this. So the URLs could be for example:
http://www.example.com/
https://www.example.com/foo/
http://www.example.com/foo/bar.html
https://alice#www.example.com/foo
And so on. So, first problem is to find the base URL from those and other URLs. The second problem is that some HTML pages contain a base tag, which could be for example http://example.com/ or simply / (although I think some browser only support the one starting with protocol://?).
Either way, how can I do this in PHP corrrectly? I have the URL and I have the HTML loaded up in a DOMDocument so should be able to grab the base tag fairly easily if it exists. How do browsers solve this for example?
Clarification on why I need this
I'm trying to create something which takes a URL to a web page and returns the absolute URL to all the images this web page links to. Since some/many/all of these images might have relative URLs, I need to find the base URL to use when I make them absolute. This might be the base URL of the web page, or it might be a base URL specified in the HTML itself.
I have managed to fetch the HTML and find the URLs. I think I've also found a working method of making the URLs absolute when I have the base URL to use. But finding the base URL is what I'm missing, and what I'm asking about here.
See parse_url().
$result=parse_url('http://www.google.com');
print_r($result);
Pick out of there whichever element you are looking for. You probably want $result['path'].
Fun with snippets!
if (!function_exists('base_url')) {
function base_url($atRoot=FALSE, $atCore=FALSE, $parse=FALSE){
if (isset($_SERVER['HTTP_HOST'])) {
$http = isset($_SERVER['HTTPS']) && strtolower($_SERVER['HTTPS']) !== 'off' ? 'https' : 'http';
$hostname = $_SERVER['HTTP_HOST'];
$dir = str_replace(basename($_SERVER['SCRIPT_NAME']), '', $_SERVER['SCRIPT_NAME']);
$core = preg_split('#/#', str_replace($_SERVER['DOCUMENT_ROOT'], '', realpath(dirname(__FILE__))), NULL, PREG_SPLIT_NO_EMPTY);
$core = $core[0];
$tmplt = $atRoot ? ($atCore ? "%s://%s/%s/" : "%s://%s/") : ($atCore ? "%s://%s/%s/" : "%s://%s%s");
$end = $atRoot ? ($atCore ? $core : $hostname) : ($atCore ? $core : $dir);
$base_url = sprintf( $tmplt, $http, $hostname, $end );
}
else $base_url = 'http://localhost/';
if ($parse) {
$base_url = parse_url($base_url);
if (isset($base_url['path'])) if ($base_url['path'] == '/') $base_url['path'] = '';
}
return $base_url;
}
}
Use as simple as:
// url like: http://stackoverflow.com/questions/2820723/how-to-get-base-url-with-php
echo base_url(); // will produce something like: http://stackoverflow.com/questions/2820723/
echo base_url(TRUE); // will produce something like: http://stackoverflow.com/
echo base_url(TRUE, TRUE); || echo base_url(NULL, TRUE); // will produce something like: http://stackoverflow.com/questions/
// and finally
echo base_url(NULL, NULL, TRUE);
// will produce something like:
// array(3) {
// ["scheme"]=>
// string(4) "http"
// ["host"]=>
// string(12) "stackoverflow.com"
// ["path"]=>
// string(35) "/questions/2820723/"
// }

1 bug to kill... Letting PHP Generate The Canonical

for building a clean canonical url, that always returns 1 base URL, im stuck in following case:
<?php
# every page
$extensions = $_SERVER['REQUEST_URI']; # path like: /en/home.ast?ln=ja
$qsIndex = strpos($extensions, '?'); # removes the ?ln=de part
$pageclean = $qsIndex !== FALSE ? substr($extensions, 0, $qsIndex) : $extensions;
$canonical = "http://website.com" . $pageclean; # basic canonical url
?>
<html><head><link rel="canonical" href="<?=$canonical?>"></head>
when URL : http://website.com/de/home.ext?ln=de
canonical: http://website.com/de/home.ext
BUT I want to remove the file extension aswell, whether its .php, .ext .inc or whatever two or three char extension .[xx] or .[xxx] so the base url becomes: http://website.com/en/home
Aaah much nicer! but How do i achieve that in current code?
Any hints are much appreciated +!
Think this should do it, just strip off the end if there is an extension, just like you did for the query string:
$pageclean = $qsIndex !== FALSE ? substr($extensions, 0, $qsIndex) : $extensions;
$dotIndex = strrpos($pageclean, '.');
$pagecleanNoExt = $dotIndex !== FALSE ? substr($pageclean, 0, $dotIndex) : $pageclean;
$canonical = "http://website.com" . $pagecleanNoExt; # basic canonical url
try this:
preg_match("/(.*)\.([^\?]{2,3})(\?(.*)){0,1}$/msiU", $_SERVER['REQUEST_URI'], $res);
$canonical = "http://website.com" . $res[1];
and $res[1] => clean url;
$res[2] = extension;
$res[4] = everything after the "?" (if present and if you need it)

Categories