I'm working in php and I have created a function that is getting links from a submitted url.
The code is working fine, but it is picking even links that are not active like mailto:, , javascript:void(0).
How can I avoid picking up a tags whose href are like: href="mailto:" ; href="tel:"; href="javascript:"?
Thanks you in advance.
function check_all_links($url) {
$doc = new DOMDocument();
#$doc->loadHTML(file_get_contents($url));
$linklist = $doc->getElementsByTagName("a");
$title = $doc->getElementsByTagName("title");
$href = array();
$page_url = $full_url = $new_url = "";
$full_url = goodUrl($url);
$scheme = parse_url($url, PHP_URL_SCHEME);
$slash = '/';
$links = array();
$linkNo = array();
if ($scheme == "http") {
foreach ($linklist as $link) {
$href = strtolower($link->getAttribute('href'));
$page_url = parse_url($href, PHP_URL_PATH);
$new_url = $scheme."://".$full_url.'/'.ltrim($page_url, '/');
//check if href has mailto: or # or javascipt() or tel:
if (strpos($page_url, "tel:") === True) {
continue;
}
if(!in_array($new_url, $linkNo)) {
echo $new_url."<br>" ;
array_push($linkNo, $new_url);
$links[] = array('Links' => $new_url );
}
}
}else if ($scheme == "https") {
foreach ($linklist as $link) {
$href = strtolower($link->getAttribute('href'));
$page_url = parse_url($href, PHP_URL_PATH);
$new_url = $scheme."://".$full_url.'/'.ltrim($page_url, '/');
if (strpos($page_url, "tel:") === True) {
continue;
}
if(!in_array($new_url, $linkNo)) {
echo $new_url."<br>" ;
array_push($linkNo, $new_url);
$links[] = array('Links' => $new_url );
}
}
}
You can use the scheme field from the parse_url function result.
Instead of:
if (strpos($page_url, "tel:") === True) {
continue;
}
you can use:
if (isset($page_url["scheme"] && in_array($page_url["scheme"], ["mailto", "tel", "javascript"]) {
continue;
}
Related
So I have a web crawler that I am working on. And I have a CSV file that have about one million websites that I want to pass to be crawled. My problem is that I am able to save the CSV file in an array but when I pass it to the method that crawls it; it seems that it takes the first element and crawls it not the whole array. Can someone help me?
<?php
include("classes/DomDocumentParser.php");
include("config.php");
$alreadyCrawled = array();
$crawling = array();
$alreadyFoundImages = array();
$my_list = array();
function linkExists($url){
global $con;
$query = $con->prepare("SELECT * FROM sites WHERE url = :url");
$query ->bindParam(":url",$url);
$query->execute();
return $query->rowCount() != 0;
}
function insertImage($url,$src,$title,$alt){
global $con;
$query = $con->prepare("INSERT INTO images(siteUrl, imageUrl, alt, title)
VALUES(:siteUrl,:imageUrl,:alt,:title)");
$query ->bindParam(":siteUrl",$url);
$query ->bindParam(":imageUrl",$src);
$query ->bindParam(":alt",$alt);
$query ->bindParam(":title",$title);
return $query->execute();
}
function insertLink($url,$title,$description,$keywords){
global $con;
$query = $con->prepare("INSERT INTO sites(Url, title, description, keywords)
VALUES(:url,:title,:description,:keywords)");
$query ->bindParam(":url",$url);
$query ->bindParam(":title",$title);
$query ->bindParam(":description",$description);
$query ->bindParam(":keywords",$keywords);
return $query->execute();
}
function createLink($src,$url){
$scheme = parse_url($url)["scheme"]; // http or https
$host = parse_url($url)["host"]; // www.mohamad-ahmad.com
if(substr($src,0,2) =="//"){
// //www.mohanadahmad.com
$src = $scheme . ":" . $src;
}
else if(substr($src,0,1) =="/"){
// /aboutus/about.php
$src = $scheme . "://" . $host . $src;
}
else if(substr($src,0,2) =="./"){
// ./aboutus/about.php
$src = $scheme . "://" . $host . dirname(parse_url($url)["path"]) . substr($src ,1);
}
else if(substr($src,0,3) =="../"){
// ../aboutus/about.php
$src = $scheme . "://" . $host . "/" . $src;
}
else if(substr($src,0,5) !="https" && substr($src,0,4) !="http" ){
// aboutus/about.php
$src = $scheme . "://" . $host ."/" .$src;
}
return $src;
}
function getDetails($url){
global $alreadyFoundImages;
$parser = new DomDocumentParser($url);
$titleArray = $parser->getTitletags();
if(sizeof($titleArray) == 0 || $titleArray->item(0) == NULL){
return;
}
$title = $titleArray -> item(0) -> nodeValue;
$title = str_replace("\n","",$title);
if($title == ""){
return;
}
$description="";
$keywords="";
$metasArray = $parser -> getMetatags();
foreach($metasArray as $meta){
if($meta->getAttribute("name") == "description"){
$description = $meta -> getAttribute("content");
}
if($meta->getAttribute("name") == "keywords"){
$keywords = $meta -> getAttribute("content");
}
}
$description = str_replace("\n","",$description);
$keywords = str_replace("\n","",$keywords);
if(linkExists($url)){
echo "$url already exists <br>";
}
else if(insertLink($url,$title,$description,$keywords)){
echo "SUCCESS: $url <br>";
}
else{
echo "ERROR: Failed to insert $url <br>";
}
$imageArray = $parser ->getImages();
foreach($imageArray as $image){
$src = $image->getAttribute("src");
$alt = $image->getAttribute("alt");
$title = $image->getAttribute("title");
if(!$title && !$alt){
continue;
}
$src = createLink($src,$url);
if(!in_array($src,$alreadyFoundImages)){
$alreadyFoundImages[] = $src;
insertImage($url,$src,$alt,$title);
}
}
}
function followLinks($url) {
global $crawling;
global $alreadyCrawled;
$parser = new DomDocumentParser($url);
$linkList = $parser->getLinks();
foreach($linkList as $link){
$href = $link->getAttribute("href");
if(strpos($href,"#") !==false){
// Ignore anchor url
continue;
}
else if(substr($href,0,11)== "javascript:"){
// Ignore javascript url
continue;
}
$href = createLink($href,$url);
if(!in_array($href,$alreadyCrawled)){
$alreadyCrawled[] = $href;
$crawling[] = $href;
//getDetails contain the insert into db
getDetails($href);
}
}
array_shift($crawling);
foreach($crawling as $site){
followLinks($site);
}
}
function fill_my_list(){
global $my_list;
$file = fopen('top-1m.csv', 'r');
while( ($data = fgetcsv($file)) !== false ) {
$startUrl = "https://www.".$data[1];
$my_list[] = $startUrl;
}
foreach($my_list as $key => $u){
followLinks($u);
}
}
fill_my_list();
?>
You can do something like this by php.net
$row = 1;
if (($File = fopen("test.csv", "r")) !== FALSE) {
while (($data = fgetcsv($File, 1000, ",")) !== FALSE) {
$num = count($data);
echo "<p> $num fields in line $row: <br /></p>\n";
$row++;
for ($c=0; $c < $num; $c++) {
echo $data[$c] . "<br />\n";
}
//Use $data[$c];
}
fclose($File);
}
Here more examples : https://www.php.net/manual/en/function.fgetcsv.php#refsect1-function.fgetcsv-examples
I'm using one small script to convert from absolute links to relative ones. It is working but it needs improvement. Not sure how to proceed. Please have a look at part of the script used for this.
Script:
public function links($path) {
$old_url = 'http://test.dev/';
$dir_handle = opendir($path);
while($item = readdir($dir_handle)) {
$new_path = $path."/".$item;
if(is_dir($new_path) && $item != '.' && $item != '..') {
$this->links($new_path);
}
// it is a file
else{
if($item != '.' && $item != '..')
{
$new_url = '';
$depth_count = 1;
$folder_depth = substr_count($new_path, '/');
while($depth_count < $folder_depth){
$new_url .= '../';
$depth_count++;
}
$file_contents = file_get_contents($new_path);
$doc = new DOMDocument;
#$doc->loadHTML($file_contents);
foreach ($doc->getElementsByTagName('a') as $link) {
if (substr($link, -1) == "/"){
$link->setAttribute('href', $link->getAttribute('href').'/index.html');
}
}
$doc->saveHTML();
$file_contents = str_replace($old_url,$new_url,$file_contents);
file_put_contents($new_path,$file_contents);
}
}
}
}
As you can see I've added inside while loop that DOMDocument but it doesn't work. What I'm trying to achieve here is to add for every link at the end index.html if last char in that link is /
What am I doing wrong?
Thank you.
Is this what you want?
$file_contents = file_get_contents($new_path);
$dom = new DOMDocument();
$dom->loadHTML($file_contents);
$xpath = new DOMXPath($dom);
$links = $xpath->query("//a");
foreach ($links as $link) {
$href = $link->getAttribute('href');
if (substr($href, -1) === '/') {
$link->setAttribute('href', $href."index.html");
}
}
$new_file_content = $dom->saveHTML();
# save this wherever you want
See a demo on ideone.com.
Hint: Your call to $dom->saveHTML() leads to nowhere (ie there's no variable capturing the output).
Linked IN
No link...Use default location.
http://www.linkedin.com/favicon.ico
Twitter
<link href="/phoenix/favicon.ico" rel="shortcut icon" type="image/x-icon" />
Pinterest
<link rel="icon" href="http://passets-cdn.pinterest.com/images/favicon.png" type="image/x-icon" />
Facebook
<link rel="shortcut icon" href="https://s-static.ak.facebook.com/rsrc.php/yi/r/q9U99v3_saj.ico" />
I've determined that the only 100% way to find a favicon is to check the source and see where the link is.
Default location is not always used. Note first 2 examples.
Google API works only about 85% of the time. Try It Out
Is there a function that can parse this info out? Or is there a good strategy for using a regex to pull it out manually.
I will be parsing the html server side to get this info.
Ideas:
Regex Example: Try Here. Seems to easy...but here is a starting point.
<link\srel="[Ss]hortcut [Ii]con"\shref="(.+)"(.+)>
Use a parser:
$dom = new DOMDocument();
#$dom->loadHTML($input);
$links = $dom->getElementsByTagName('link');
$l = $links->length;
$favicon = "/favicon.ico";
for( $i=0; $i<$l; $i++) {
$item = $links->item($i);
if( strcasecmp($item->getAttribute("rel"),"shortcut icon") === 0) {
$favicon = $item->getAttribute("href");
break;
}
}
// You now have your $favicon
Alternative to PHP 5 DOMDocument: raw regex
This works for all cases so far.
$pattern = '#<link\s+(?=[^>]*rel="(?:shortcut\s)?icon"\s+)(?:[^>]*href="(.+?)").*>#i';
You will have to work around several issues, like site redirects and various caveats. Here is what I did to harvest something like 90% of my websites feeds favicons:
<?
/*
nws-favicon : Get site's favicon using various strategies
This script is part of NWS
https://github.com/xaccrocheur/nws/
*/
function CheckImageExists($imgUrl) {
if (#GetImageSize($imgUrl)) {
return true;
} else {
return false;
};
};
function getFavicon ($url) {
$fallback_favicon = "/var/www/favicon.ico";
$dom = new DOMDocument();
#$dom->loadHTML($url);
$links = $dom->getElementsByTagName('link');
$l = $links->length;
$favicon = "/favicon.ico";
for( $i=0; $i<$l; $i++) {
$item = $links->item($i);
if( strcasecmp($item->getAttribute("rel"),"shortcut icon") === 0) {
$favicon = $item->getAttribute("href");
break;
}
}
$u = parse_url($url);
$subs = explode( '.', $u['host']);
$domain = $subs[count($subs) -2].'.'.$subs[count($subs) -1];
$file = "http://".$domain."/favicon.ico";
$file_headers = #get_headers($file);
if($file_headers[0] == 'HTTP/1.1 404 Not Found' || $file_headers[0] == 'HTTP/1.1 404 NOT FOUND' || $file_headers[0] == 'HTTP/1.1 301 Moved Permanently') {
$fileContent = #file_get_contents("http://".$domain);
$dom = #DOMDocument::loadHTML($fileContent);
$xpath = new DOMXpath($dom);
$elements = $xpath->query("head/link//#href");
$hrefs = array();
foreach ($elements as $link) {
$hrefs[] = $link->value;
}
$found_favicon = array();
foreach ( $hrefs as $key => $value ) {
if( substr_count($value, 'favicon.ico') > 0 ) {
$found_favicon[] = $value;
$icon_key = $key;
}
}
$found_http = array();
foreach ( $found_favicon as $key => $value ) {
if( substr_count($value, 'http') > 0 ) {
$found_http[] = $value;
$favicon = $hrefs[$icon_key];
$method = "xpath";
} else {
$favicon = $domain.$hrefs[$icon_key];
if (substr($favicon, 0, 4) != 'http') {
$favicon = 'http://' . $favicon;
$method = "xpath+http";
}
}
}
if (isset($favicon)) {
if (!CheckImageExists($favicon)) {
$favicon = $fallback_favicon;
$method = "fallback";
}
} else {
$favicon = $fallback_favicon;
$method = "fallback";
}
} else {
$favicon = $file;
$method = "classic";
if (!CheckImageExists($file)) {
$favicon = $fallback_favicon;
$method = "fallback";
}
}
return $favicon;
}
?>
I have a a function which return links from a given page using regular expression in php,
Now I want to go after each link in found link and so on....
Here is the code I have
function getLinks($url){
$content = file_get_contents($url);
preg_match_all("|<a [^>]+>(.*)</[^>]+>|U", $content, $links, PREG_PATTERN_ORDER);
$l_clean = array();
foreach($links[0] as $link){
$e_link = explode("href",$link);
$e_link = explode("\"",$e_link[1]);
$f_link = $e_link[1];
if( (substr($f_link,0,strlen('javascript:;')) != "javascript:;")){
$sperator = "";
$first = substr($f_link,0,1);
if($first != "/"){
$f_link = "/$f_link";
}
if(substr($f_link,0,7) != "http://"){
$f_link = "http://" . $sperator . $_SERVER['HTTP_HOST'] . $f_link;
}
$f_link = str_replace("///","//",$f_link);
if(!in_array($f_link, $l_clean)){
array_push($l_clean , $f_link);
}
}
}
}
Just do it recursively, and set a depth to terminate:
function getLinks($url, $depth){
if( --$depth <= 0 ) return;
$content = file_get_contents($url);
preg_match_all("|<a [^>]+>(.*)</[^>]+>|U", $content, $links, PREG_PATTERN_ORDER);
$l_clean = array();
foreach($links[0] as $link){
$e_link = explode("href",$link);
$e_link = explode("\"",$e_link[1]);
$f_link = $e_link[1];
if( (substr($f_link,0,strlen('javascript:;')) != "javascript:;")){
$sperator = "";
$first = substr($f_link,0,1);
if($first != "/"){
$f_link = "/$f_link";
}
if(substr($f_link,0,7) != "http://"){
$f_link = "http://" . $sperator . $_SERVER['HTTP_HOST'] . $f_link;
}
$f_link = str_replace("///","//",$f_link);
if(!in_array($f_link, $l_clean)){
array_push($l_clean , $f_link);
getLinks( $f_link, $depth );
}
}
}
}
$links = getLinks("http://myurl.com", 3);
I'm trying to add the results of a script to an array, but once I look into it there is only one item in it, probably me being silly with placement
function crawl_page($url, $depth)
{
static $seen = array();
$Linklist = array();
if (isset($seen[$url]) || $depth === 0) {
return;
}
$seen[$url] = true;
$dom = new DOMDocument('1.0');
#$dom->loadHTMLFile($url);
$anchors = $dom->getElementsByTagName('a');
foreach ($anchors as $element) {
$href = $element->getAttribute('href');
if (0 !== strpos($href, 'http')) {
$href = rtrim($url, '/') . '/' . ltrim($href, '/');
}
if(shouldScrape($href)==true)
{
crawl_page($href, $depth - 1);
}
}
echo "URL:",$url;
echo http_response($url);
echo "<br/>";
$Linklist[] = $url;
$XML = new DOMDocument('1.0');
$XML->formatOutput = true;
$root = $XML->createElement('Links');
$root = $XML->appendChild($root);
foreach ($Linklist as $value)
{
$child = $XML->createElement('Linkdetails');
$child = $root->appendChild($child);
$text = $XML->createTextNode($value);
$text = $child->appendChild($text);
}
$XML->save("linkList.xml");
}
$Linklist[] = $url; will add a single item to the $Linklist array. This line needs to be in a loop I think.
static $Linklist = array(); i think, but code is awful